Chapter 6. Classification Algorithm
Recommended Readings : Algorithms
1. Overview
2. Type 1. Linear Classifier
3. Type 2. K-Nearest Neighbors Algorithm
4. Type 3. Decision Trees
5. Type 4. Base Classifier
6. Type 5. Domain Adaptation Classification
7. Type 6.** Deep Learning-Based Classification
b. Calibrated Classification Model
1. Overview
⑴ Definition
① y ~ X (where | { y } | < ∞ )
② Belongs to the supervised algorithm category
⑵ Calibrated Classification Model
2. Type 1. Linear Classifier
⑴ Overview
① Utilizes non-linear regression analysis
② Drawback 1. Linear classifiers can learn only one template per class
③ Drawback 2. Linear classifiers can implement only linear decision boundaries among decision boundaries
④ (Reference) Hyperplane: Represents multi-dimensional surfaces beyond 3D for multiple regression
⑵ 1-1. SVM (Support Vector Machine)
① Overview
○ Definition: Technique for classifying points into two categories by setting the optimal hyperplane (maximum margin hyperplane; maximum margin hyperplane) that maximizes the margin in high-dimensional or infinite-dimensional space
Figure 1. SVM and Hyperplane
○ Multiple support vectors are possible: Setting multiple hyperplanes can classify points into multiple sets
○ Linearly inseparable classification problems can be solved by mapping from the original space to a higher-dimensional space
○ Advantages: Frequently used method with well-established research. Strong resistance to overfitting compared to other models. High accuracy.
○ Disadvantages: Requires providing a test set for supervised learning. Slower than other models.
② Formula
Figure 2. SVM Formula
○ Significance of the final formula: It is easy to apply an optimization algorithm and can be extended to classification problems that are not linearly separable by applying a kernel.
○ By introducing the slack variable ξi = 1 - yi (wxi + b), some data points that violate the constraints are allowed.
③ Kernels used in SVM
○ Linear Kernel: Basic type of kernel, 1-dimensional and faster than other functions
○ Polynomial Kernel: A generalized formula for linear kernels, less efficient in terms of effectiveness and accuracy
○ Gaussian Kernel: Generally used kernel, used when there is no prior knowledge of data
○ Gaussian RBF Kernel: The most commonly used kernel, typically used for nonlinear data. In the following equation, the vector ℓ represents a landmark (e.g., the center of a data point).
○ Sigmoid Kernel: Preferred as a kernel in artificial neural networks
④ Implementation in R
## data
print(head(iris))
## training
library(e1071)
train <- sample(1:150, 100)
sv <- svm(Species ~., data= iris, subset = train, type = "C-classification")
## testing
pred = predict(sv, iris[-train, ])
print(head(pred))
## review
tt <- table(iris$Species[-train], pred)
print(tt) # confusion matrix
⑶ 1-2. nu-SVM
⑷ 1-3. Probability Regression Model: LPM, probit, logit, etc.
① R Code
## R code
# Train the logistic regression model
model <- glm(species ~ ., data = x_train, family = binomial())
# Predict probabilities
prob_estimates <- predict(model, x_test, type = "response")
3. Type 2. K-Nearest Neighbors Algorithm (K-NN)
⑴ Overview
① A method to label a given test data point using K nearest training data points
② Example: YouTube recommendation system
③ Doesn’t work well beyond 5 dimensions
⑵ Step 1. Data Loading and Architecture Design
import numpy as np
import tensorflow as tf
class NearestNeighbor(object):
...
(Xtr, Ytr), (Xte, Yte) = tf.keras.datasets.mnist.load_data()
print(Xtr.shape) # (60000, 28, 28)
print(Ytr.shape) # (60000,)
print(Xte.shape) # (10000, 28, 28)
print(Yte.shape) # (10000,)
Xtr_rows = Xtr.reshape(Xtr.shape[0], 28*28)
Xte_rows = Xte.reshape(Xte.shape[0], 28*28)
print(Xtr_rows.shape) # (60000, 784)
print(Xte_rows.shape) # (10000, 784)
nn = NearestNeighbor()
nn.train(Xtr_rows, Ytr)
Yte_predict = nn.predict(Xte_rows, 'L2')
print ('accuracy: %f' + str(np.mean(Yte_predict == Yte)) )
# L1 : accuracy = 47.29%
# L2 : accuracy = 27.24%
# dot : accuracy = 9.9%
⑶ Step 2. NN Class Design
class NearestNeighbor(object):
def __init__(self):
pass
def train(self, X, Y):
self.Xtr = X
self.Ytr = Y
def predict(self, X, dist_metric='L2'):
num_test = X.shape[0]
Ypred = np.zeros(num_test, dtype = self.Ytr.dtype)
for i in range(num_test):
if dist_metric=='L1': ## L1 distance
distances = np.sum(np.abs(self.Xtr - X[i, :]), axis = 1)
elif dist_metric=='L2': ## L2 distance
distances = np.sum(np.square(self.Xtr - X[i, :]), axis = 1)
elif dist_metric=='dot': ## dot distance
distances = np.dot(self.Xtr, X[i, :])
min_index = np.argmin(distances)
Ypred[i] = self.Ytr[min_index]
if i%100 == 0:
print(i)
return Ypred
① The training phase is merely a data storage phase
② Factor 1. Designing the distance function
○ Distance Function(distance function, metric): Defines distance
○ 1-1. L1 Distance
○ 1-2. L2 Distance: Euclidean distance using the Pythagorean theorem (standard)
○ 1-3. p-norm
○ 1-4. Dot Product
○ 1-5. KL Distance (Kullback-Leibler Divergence)(Kullback-Leibler divergence)
○ 1-6. Other distance functions
○ Results from L1 and L2 distances are similar: L1 38.6%, L2 35.4%
③ Factor 2. 1-NN vs. k-NN
○ The edges of each region become smoother as k increases
○ If k is too small, it is not robust and may respond to outliers or noise when labeling
○ Ambiguous regions appear as k increases
⑷ Step 3. Hyperparameter Tuning: Try various k values and choose the one that yields the best results
Xval_rows = Xtr_rows[:1000, :] # take first 1000 for validation
Yval = Ytr[:1000]
Xtr_rows = Xtr_rows[1000:, :] # keep last 59,000 for train
Ytr = Ytr[1000:]
validation_accuracies = []
for k in [1, 3, 5, 10, 20, 50, 100]:
nn = NearestNeighbor()
nn.train(Xtr_rows, Ytr)
Yval_predict = nn.predict(Xval_rows, k = k)
acc = np.mean(Yval_predict == Yval)
print 'accuracy: %f' % (acc,)
validation_accuracies.append((k, acc))
⑸ Step 4. Improvement
① Drawbacks
○ Using KNN at the pixel level may not result in true similarity, leading to poor performance
○ Test time is very slow
○ Curse of dimensionality: As the dimension increases, the number of data required to maintain the same distance between examples increases geometrically
○ Too sparse examples among examples can lead to sloppy classification results
② Improvements
○ Data Normalization: Normalize data to have a mean of 0 and a standard deviation of 1
○ Dimensionality Reduction: Techniques like PCA, NCA, random projection, etc.
○ Cross-Validation: Computationally expensive, commonly used methods include 3-fold, 5-fold, and 10-fold cross-validation
○ Approximate Nearest Neighbors: Reduces accuracy but improves speed (e.g., FLANN)
③ Despite these improvements, KNN is rarely used in practice
⑹ 2-1. Mutual Nearest Neighbors Algorithm
⑺ Comparison between K-NN and K-Means Clustering
Item | K-NN | K-Means Clustering |
---|---|---|
Type | Supervised Learning | Unsupervised Learning |
Meaning of k | Number of Nearest Neighbors | Number of Clusters |
Optimization Techniques | Cross Validation, Confusion Matrix | Elbow Method, Silhouette Method |
Application | Classification | Clustering |
Table. 1. Comparison of K-NN and K-Means Clustering
4. Type 3. Decision Trees
⑴ Overview
① Models for predicting output values for given input values, including single trees and regression tree models
② Technique of drawing decision trees with decision rules using classification functions for analysis
③ The purity of child nodes increases compared to the purity of parent nodes
④ Calculation results are directly shown in decision trees
⑤ Disadvantage: Predictive errors are significant near the boundary points of separation because continuous variables are treated as discontinuous values
⑵ 3-1. Simple Decision Trees
① Overview
○ Advantage 1. Good representation for given data
○ Advantage 2. Interpretable and fast
○ Advantage 3. Easily handles the following data: categorical, mixed data, missing data, multiple classes
○ Drawback 1. Potential for overfitting, meaning it is vulnerable to outliers.
○ Drawback 2. The tree becomes too deep.
○ Solution: Tree pruning
② Terminology
○ Root Node
○ Internal Node (Node): Forms conditional statements for attributes
○ Branching: Splits from internal nodes based on attribute values
○ Leaf Node (Terminal Node, Decision Node): Outputs (class assignment)
③ Learning Decision Trees
○ Condition 1. Not too big or too small
○ Condition 2. Occam’s Razor Theory: Simplicity is the best
○ Condition 3. Complexity depends on depth
○ Learning the smallest and simplest decision tree is an NP-complete problem (Hyafil & Rivest (1976))
○ Decision trees can be learned greedily using heuristic methods
### BuildTree (D): Greedy training of a decision tree ###
# Input: A data set D = (X, Y).
# Output: A decision tree.
if LeafCondition(D) then
fn = FindBestPrediction(D)
else
jn, tn = FindBestSplid(D)
DL = {(x(i), y(i)) : x(i)_(jn) < tn} and
DR = {(x(i), y(i)) : x(i)_(jn) ≥ tn}
leftChild = BuildTree(DL)
rightChild = BuildTree(DR)
end if
○ Step 1. Score all splits
○ Step 2. Greedily find the condition with the maximum information gain
○ Step 3. Divide the dataset with that condition and recursively perform Step 1 and Step 2 in each subtree
○ Step 4. Stopping criteria
④ R Code
## R code
library(rpart)
# Train the decision tree model
model <- rpart(species ~ ., data = x_train, method = "class")
# Predict probabilities
prob_estimates <- predict(model, x_test)
⑤ Application 1: Pruning
⑥ Application 2: Information Gain Ratio
⑶ 3-2. Bagging (Bootstrap Aggregation)
① Definition: Determines classification results for different datasets obtained by sampling with replacement by voting
Figure 3. Bagging Illustration
② Bootstrap: A series of processes that proceed automatically once started, not limited to decision trees
③ Advantage 1. Reduces the variance of unstable processes like trees, improving predictability
④ Advantage 2. Smoothes the boundaries of decision trees
⑷ 3-3. Random Forest
① Definition: Utilizes m features randomly selected from each sub-dataset for bagging
○ Example: When learning, randomly select only some features such as city, population, average income, number of newborns, and number of newlyweds among various features
② Feature 1. De-correlating: Increases individuality of each sub-dataset by using only some features
③ Feature 2. Unused datasets in each sub-dataset can be used to validate out-of-sample errors
④ Performance: Random forest > bagging > decision tree
○ Accuracy: Comparable or higher than Adaboost
○ Robustness: Relatively robust against outliers and noise
○ Speed: Faster than bagging and boosting
○ Simplicity and easy parallelization
○ Currently the best-performing algorithm in classification algorithm research
⑤ R Code
## R code
library(randomForest)
# Train the random forest model
model <- randomForest(species ~ ., data = x_train)
# Predict probabilities
prob_estimates <- predict(model, x_test, type = "prob")
⑸ 3-4. C4.5 and C5.0
① Developed by Australian researcher J. Ross Quinlan
② The initial version was developed as ID 3 (Iterative Dichotomizer 3) in 1986
③ Uses training data when using pruning
④ Uses entropy index as a measure of impurity with discrete dependent variables
⑹ 3-5. CHAID
① Uses the chi-squared statistic and employs multiway splits in the decision tree algorithm
⑺ 3-6. CART (Classification and Regression Tree)
① Performs classification by repeatedly bifurcating each independent variable to form a binary tree
⑻ 3-7. AdaBoost
⑼ 3-8. QUEST
5. Type 4. Base Classifier
⑴ Naïve Bayes classifier
① Overview
○ An algorithm that assumes conditional independence between variables in the data.
○ An algorithm that combines prior information about classes with information extracted from data, using Bayes’ theorem to classify which class a given data point belongs to.
○ It can be used as a solution for problems in text classification, such as determining which category (e.g., spam, economics, sports, etc.) a document belongs to.
② Formalization
○ Premise: There are K classes C1, C2, ···, CK.
○ Objective: Given a new sample x = (x1, ···, xN) composed of multiple features, determine which class it belongs to.
○ Formalization: Similar to MLE (Maximum Likelihood Estimation).
③ Limitations
○ Violation of conditional independence: If there are significant correlations between features (e.g., age and history of heart disease), related variables may be multiplied multiple times, leading to an overestimation of their effects.
○ Inability to exclude the influence of irrelevant features.
○ Solution: By using methods such as PCR or mRMR to discard unnecessary variables and retain only the essential ones, these limitations can be mitigated.
⑵ Bayesian Network
6. Type 5. Domain Adaptation Classification
⑴ Overview
① Definition : Obtaining universal knowledge from one domain and applying it to another domain.
② One domain : Referred to as the source domain, tangential domain, etc. Large dataset size.
③ Another domain : Referred to as the target domain, specific domain, etc. Small dataset size.
④ Differences from generalization
○ Domain adaptation classification : Differences in format, data structure, etc., between source and target.
○ Generalization : Format, data structure, etc., are the same between source and target.
⑵ Types
① UDA (Unsupervised Domain Adaptation) : When there are no labels in the target domain.
Figure 4. Diagram of UDA
○ In other words, the structure of applying features obtained from the source domain directly to the target domain.
② SSL (Supervised Domain Adaptation) : When there are labels in the target domain.
③ SSDA (Semi-Supervised Domain Adaptation) : When the target domain has labels for only some cases.
Figure 5. Diagram of SSDA
○ Process 1: Augmentation
○ Process 2: Random logit interpolation
○ Process 3: Distribution alignment
○ Process 4: Relative confidence threshold
○ Process 5: Loss function
○ Reason for being state-of-the-art (SOTA): Because of the trade-off, as can be confirmed in the ‘motivation’ section in the link.
⑶ Objective Functions
① Aims to reduce the discrepancy between domains.
② Types of discrepancies
○ MMD (Maximum Mean Discrepancy)
○ DANN (Domain Adversarial Neural Network)
○ MCD (Maximum Classifier Discrepancy)
① Divergence Domain Adaptation
② Adversarial Domain Adaptation
○ Generative Adversarial Network (GAN)
○ Definition : Two AI systems learning while competing with each other (game learning).
○ Generative Model : Utilizes knowledge learned based on various features of existing data to generate new data.
○ Discriminative Model : Evaluates how similar the generated new data is to the existing data.
○ Automatically creates results that are close to reality by exchanging ideas between the generative and discriminative models.
③ Reconstruction Domain Adaptation
7. Type 6. Deep Learning-Based Classification
⑴ Deep Learning (Part 1, Part 2, Part 3)
⑵ 6-1. 1D CNN
8. Other Algorithms
⑴ Faiss
① A similarity-based search and dense vector clustering algorithm developed by Meta in 2023.
② Fast search algorithm: capable of obtaining the nearest neighbor and the k-th nearest neighbor.
③ Allows for the search of multiple vectors at once (batch processing).
④ There is a trade-off between accuracy and speed.
⑤ Instead of minimizing Euclidean distance, it calculates by maximizing the maximum inner product.
⑥ Returns all elements contained within a given radius.
Input : 2021.12.12 11:37