Chapter 6. Classification Algorithm
Recommended Readings: Algorithms
1. Overview
2. Type 1. Linear Classifier
3. Type 2. K-Nearest Neighbors Algorithm
4. Type 3. Decision Trees
5. Type 4. Base Classifier
6. Type 5. Domain Adaptation Classification
7. Type 6.** Deep Learning-Based Classification
b. Calibrated Classification Model
1. Overview
⑴ Definition
① y ~ X (where | { y } | < ∞ )
② Belongs to the supervised algorithm category
⑵ Calibrated Classification Model
2. Type 1. Linear Classifier
⑴ Overview
① Utilizes non-linear regression analysis
② Drawback 1. Linear classifiers can learn only one template per class
③ Drawback 2. Linear classifiers can implement only linear decision boundaries among decision boundaries
④ (Reference) Hyperplane: Represents multi-dimensional surfaces beyond 3D for multiple regression
⑵ 1-1. SVM (Support Vector Machine)
① Overview
○ Definition: Technique for classifying points into two categories by setting the optimal hyperplane (maximum margin hyperplane; maximum margin hyperplane) that maximizes the margin in high-dimensional or infinite-dimensional space
Figure 1. SVM and Hyperplane
○ Multiple support vectors are possible: Setting multiple hyperplanes can classify points into multiple sets
○ Linearly inseparable classification problems can be solved by mapping from the original space to a higher-dimensional space
○ Advantages: Frequently used method with well-established research. Strong resistance to overfitting compared to other models. High accuracy.
○ Disadvantages: Requires providing a test set for supervised learning. Slower than other models.
② Formula
Figure 2. SVM Formula
○ Significance of the final formula: It is easy to apply an optimization algorithm and can be extended to classification problems that are not linearly separable by applying a kernel.
○ By introducing the slack variable ξi = 1 - yi (wxi + b), some data points that violate the constraints are allowed.
③ Kernels used in SVM
○ Linear Kernel: Basic type of kernel, 1-dimensional and faster than other functions
○ Polynomial Kernel: A generalized formula for linear kernels, less efficient in terms of effectiveness and accuracy
○ Gaussian Kernel: Generally used kernel, used when there is no prior knowledge of data
○ Gaussian RBF Kernel: The most commonly used kernel, typically used for nonlinear data. In the following equation, the vector ℓ represents a landmark (e.g., the center of a data point).
○ Sigmoid Kernel: Preferred as a kernel in artificial neural networks
④ Implementation in R
## data
print(head(iris))
## training
library(e1071)
train <- sample(1:150, 100)
sv <- svm(Species ~., data= iris, subset = train, type = "C-classification")
## testing
pred = predict(sv, iris[-train, ])
print(head(pred))
## review
tt <- table(iris$Species[-train], pred)
print(tt) # confusion matrix
⑶ 1-2. nu-SVM
⑷ 1-3. Probability Regression Model: LPM, probit, logit, etc.
① R Code
## R code
# Train the logistic regression model
model <- glm(species ~ ., data = x_train, family = binomial())
# Predict probabilities
prob_estimates <- predict(model, x_test, type = "response")
3. Type 2. K-Nearest Neighbors Algorithm (K-NN)
⑴ Overview
① A method to label a given test data point using K nearest training data points
② Example: YouTube recommendation system
③ Doesn’t work well beyond 5 dimensions
⑵ Step 1. Data Loading and Architecture Design
import numpy as np
import tensorflow as tf
class NearestNeighbor(object):
...
(Xtr, Ytr), (Xte, Yte) = tf.keras.datasets.mnist.load_data()
print(Xtr.shape) # (60000, 28, 28)
print(Ytr.shape) # (60000,)
print(Xte.shape) # (10000, 28, 28)
print(Yte.shape) # (10000,)
Xtr_rows = Xtr.reshape(Xtr.shape[0], 28*28)
Xte_rows = Xte.reshape(Xte.shape[0], 28*28)
print(Xtr_rows.shape) # (60000, 784)
print(Xte_rows.shape) # (10000, 784)
nn = NearestNeighbor()
nn.train(Xtr_rows, Ytr)
Yte_predict = nn.predict(Xte_rows, 'L2')
print ('accuracy: %f' + str(np.mean(Yte_predict == Yte)) )
# L1 : accuracy = 47.29%
# L2 : accuracy = 27.24%
# dot : accuracy = 9.9%
⑶ Step 2. NN Class Design
class NearestNeighbor(object):
def __init__(self):
pass
def train(self, X, Y):
self.Xtr = X
self.Ytr = Y
def predict(self, X, dist_metric='L2'):
num_test = X.shape[0]
Ypred = np.zeros(num_test, dtype = self.Ytr.dtype)
for i in range(num_test):
if dist_metric=='L1': ## L1 distance
distances = np.sum(np.abs(self.Xtr - X[i, :]), axis = 1)
elif dist_metric=='L2': ## L2 distance
distances = np.sum(np.square(self.Xtr - X[i, :]), axis = 1)
elif dist_metric=='dot': ## dot distance
distances = np.dot(self.Xtr, X[i, :])
min_index = np.argmin(distances)
Ypred[i] = self.Ytr[min_index]
if i%100 == 0:
print(i)
return Ypred
① The training phase is merely a data storage phase
② Factor 1. Designing the distance function
○ Distance Function(distance function, metric): Defines distance
○ 1-1. L1 Distance
○ 1-2. L2 Distance: Euclidean distance using the Pythagorean theorem (standard)
○ 1-3. p-norm
○ 1-4. Dot Product
○ 1-5. KL Distance (Kullback-Leibler Divergence)(Kullback-Leibler divergence)
○ 1-6. Other distance functions
○ Results from L1 and L2 distances are similar: L1 38.6%, L2 35.4%
③ Factor 2. 1-NN vs. k-NN
○ The edges of each region become smoother as k increases
○ If k is too small, it is not robust and may respond to outliers or noise when labeling
○ Ambiguous regions appear as k increases
⑷ Step 3. Hyperparameter Tuning: Try various k values and choose the one that yields the best results
Xval_rows = Xtr_rows[:1000, :] # take first 1000 for validation
Yval = Ytr[:1000]
Xtr_rows = Xtr_rows[1000:, :] # keep last 59,000 for train
Ytr = Ytr[1000:]
validation_accuracies = []
for k in [1, 3, 5, 10, 20, 50, 100]:
nn = NearestNeighbor()
nn.train(Xtr_rows, Ytr)
Yval_predict = nn.predict(Xval_rows, k = k)
acc = np.mean(Yval_predict == Yval)
print 'accuracy: %f' % (acc,)
validation_accuracies.append((k, acc))
⑸ Step 4. Improvement
① Drawbacks
○ Using KNN at the pixel level may not result in true similarity, leading to poor performance
○ Test time is very slow
○ Curse of dimensionality: As the dimension increases, the number of data required to maintain the same distance between examples increases geometrically
○ Too sparse examples among examples can lead to sloppy classification results
② Improvements
○ Data Normalization: Normalize data to have a mean of 0 and a standard deviation of 1
○ Dimensionality Reduction: Techniques like PCA, NCA, random projection, etc.
○ Cross-Validation: Computationally expensive, commonly used methods include 3-fold, 5-fold, and 10-fold cross-validation
○ Approximate Nearest Neighbors: Reduces accuracy but improves speed (e.g., FLANN)
③ Despite these improvements, KNN is rarely used in practice
⑹ 2-1. Mutual Nearest Neighbors Algorithm
⑺ Comparison between K-NN and K-Means Clustering
Item | K-NN | K-Means Clustering |
---|---|---|
Type | Supervised Learning | Unsupervised Learning |
Meaning of k | Number of Nearest Neighbors | Number of Clusters |
Optimization Techniques | Cross Validation, Confusion Matrix | Elbow Method, Silhouette Method |
Application | Classification | Clustering |
Table. 1. Comparison of K-NN and K-Means Clustering
4. Type 3. Decision Trees
⑴ Overview
① Models for predicting output values for given input values, including single trees and regression tree models
② Technique of drawing decision trees with decision rules using classification functions for analysis
③ The purity of child nodes increases compared to the purity of parent nodes
④ Calculation results are directly shown in decision trees
⑤ Disadvantage: Predictive errors are significant near the boundary points of separation because continuous variables are treated as discontinuous values
⑵ 3-1. Simple Decision Trees
① Overview
○ Advantage 1. Good representation for given data
○ Advantage 2. Interpretable and fast
○ Advantage 3. Easily handles the following data: categorical, mixed data, missing data, multiple classes
○ Drawback 1. Potential for overfitting, meaning it is vulnerable to outliers.
○ Drawback 2. The tree becomes too deep.
○ Solution: Tree pruning
② Terminology
○ Root Node
○ Internal Node (Node): Forms conditional statements for attributes
○ Branching: Splits from internal nodes based on attribute values
○ Leaf Node (Terminal Node, Decision Node): Outputs (class assignment)
③ Learning Decision Trees
○ Condition 1. Not too big or too small
○ Condition 2. Occam’s Razor Theory: Simplicity is the best
○ Condition 3. Complexity depends on depth
○ Learning the smallest and simplest decision tree is an NP-complete problem (Hyafil & Rivest (1976))
○ Decision trees can be learned greedily using heuristic methods
### BuildTree (D): Greedy training of a decision tree ###
# Input: A data set D = (X, Y).
# Output: A decision tree.
if LeafCondition(D) then
fn = FindBestPrediction(D)
else
jn, tn = FindBestSplid(D)
DL = {(x(i), y(i)) : x(i)_(jn) < tn} and
DR = {(x(i), y(i)) : x(i)_(jn) ≥ tn}
leftChild = BuildTree(DL)
rightChild = BuildTree(DR)
end if
○ Step 1. Score all splits
○ Step 2. Greedily find the condition with the maximum information gain
○ Step 3. Divide the dataset with that condition and recursively perform Step 1 and Step 2 in each subtree
○ Step 4. Stopping criteria
④ R Code
## R code
library(rpart)
# Train the decision tree model
model <- rpart(species ~ ., data = x_train, method = "class")
# Predict probabilities
prob_estimates <- predict(model, x_test)
⑤ Application 1: Pruning
⑥ Application 2: Information Gain Ratio
⑶ 3-2. Bagging (Bootstrap Aggregation)
① Definition: Determines classification results for different datasets obtained by sampling with replacement by voting
Figure 3. Bagging Illustration
② Bootstrap: A series of processes that proceed automatically once started, not limited to decision trees
③ Advantage 1. Reduces the variance of unstable processes like trees, improving predictability
④ Advantage 2. Smoothes the boundaries of decision trees
⑷ 3-3. Random Forest
① Definition: Utilizes m features randomly selected from each sub-dataset for bagging
○ Example: When learning, randomly select only some features such as city, population, average income, number of newborns, and number of newlyweds among various features
② Feature 1. De-correlating: Increases individuality of each sub-dataset by using only some features
③ Feature 2. Unused datasets in each sub-dataset can be used to validate out-of-sample errors
④ Performance: Random forest > bagging > decision tree
○ Accuracy: Comparable or higher than Adaboost
○ Robustness: Relatively robust against outliers and noise
○ Speed: Faster than bagging and boosting
○ Simplicity and easy parallelization
○ Currently the best-performing algorithm in classification algorithm research
⑤ R Code
## R code
library(randomForest)
# Train the random forest model
model <- randomForest(species ~ ., data = x_train)
# Predict probabilities
prob_estimates <- predict(model, x_test, type = "prob")
⑸ 3-4. C4.5 and C5.0
① Developed by Australian researcher J. Ross Quinlan
② The initial version was developed as ID 3 (Iterative Dichotomizer 3) in 1986
③ Uses training data when using pruning
④ Uses entropy index as a measure of impurity with discrete dependent variables
⑹ 3-5. CHAID
① Uses the chi-squared statistic and employs multiway splits in the decision tree algorithm
⑺ 3-6. CART (Classification and Regression Tree)
① Performs classification by repeatedly bifurcating each independent variable to form a binary tree
⑻ 3-7. AdaBoost
⑼ 3-8. QUEST
⑽ 3-9. K-D Tree
① Definition: A spatial partitioning data structure used to organize points in k-dimensional space.
② Structure of a K-D Tree: The space is divided iteratively by locating the median point along the x-axis and then the y-axis alternately for each partition.
Figure 4. Structure of a K-D Tree
③ Calculating the Shortest Distance Using a K-D Tree:
○ Step 1: Start the search from the root node.
○ Step 2: Based on the splitting axis of each partition, choose the child node where the given coordinates (coords) might belong and move down the tree.
○ Step 3: Repeat Step 2 until reaching the leaf node that is most likely to contain the given coordinates.
○ Step 4: Calculate the Euclidean distance between the points in the leaf node and the given coords.
○ Step 5: Based on the shortest distance found in the leaf node, evaluate whether the parent nodes and sibling nodes could contain closer neighbors. If the sibling node is worth exploring, repeat the distance calculation for that node.
○ Step 6: Leverage the spatial partitioning characteristics of the K-D Tree to skip unnecessary nodes (pruning).
5. Type 4. Base Classifier
⑴ Naïve Bayes classifier
① Overview
○ An algorithm that assumes conditional independence between variables in the data.
○ An algorithm that combines prior information about classes with information extracted from data, using Bayes’ theorem to classify which class a given data point belongs to.
○ It can be used as a solution for problems in text classification, such as determining which category (e.g., spam, economics, sports, etc.) a document belongs to.
② Formalization
○ Premise: There are K classes C1, C2, ···, CK.
○ Objective: Given a new sample x = (x1, ···, xN) composed of multiple features, determine which class it belongs to.
○ Formalization: Similar to MLE (Maximum Likelihood Estimation).
③ Limitations
○ Violation of conditional independence: If there are significant correlations between features (e.g., age and history of heart disease), related variables may be multiplied multiple times, leading to an overestimation of their effects.
○ Inability to exclude the influence of irrelevant features.
○ Solution: By using methods such as PCR or mRMR to discard unnecessary variables and retain only the essential ones, these limitations can be mitigated.
⑵ Bayesian Network
6. Type 5. Domain Adaptation Classification
⑴ Overview
① Definition: Obtaining universal knowledge from one domain and applying it to another domain.
② One domain: Referred to as the source domain, tangential domain, etc. Large dataset size.
③ Another domain: Referred to as the target domain, specific domain, etc. Small dataset size.
④ Differences from generalization
○ Domain adaptation classification: Differences in format, data structure, etc., between source and target.
○ Generalization: Format, data structure, etc., are the same between source and target.
⑵ Types
① UDA (Unsupervised Domain Adaptation): When there are no labels in the target domain.
Figure 5. Diagram of UDA
○ In other words, the structure of applying features obtained from the source domain directly to the target domain.
② SSL (Supervised Domain Adaptation): When there are labels in the target domain.
③ SSDA (Semi-Supervised Domain Adaptation): When the target domain has labels for only some cases.
Figure 6. Diagram of SSDA
○ Process 1: Augmentation
○ Process 2: Random logit interpolation
○ Process 3: Distribution alignment
○ Process 4: Relative confidence threshold
○ Process 5: Loss function
○ Reason for being state-of-the-art (SOTA): Because of the trade-off, as can be confirmed in the ‘motivation’ section in the link.
⑶ Objective Functions
① Aims to reduce the discrepancy between domains.
② Types of discrepancies
○ MMD (Maximum Mean Discrepancy)
○ DANN (Domain Adversarial Neural Network)
○ MCD (Maximum Classifier Discrepancy)
① Divergence Domain Adaptation
② Adversarial Domain Adaptation
○ Generative Adversarial Network (GAN)
○ Definition: Two AI systems learning while competing with each other (game learning).
○ Generative Model: Utilizes knowledge learned based on various features of existing data to generate new data.
○ Discriminative Model: Evaluates how similar the generated new data is to the existing data.
○ Automatically creates results that are close to reality by exchanging ideas between the generative and discriminative models.
③ Reconstruction Domain Adaptation
7. Type 6. Deep Learning-Based Classification
⑴ Deep Learning (Part 1, Part 2, Part 3)
⑵ 6-1. 1D CNN
8. Other Algorithms
⑴ Faiss
① A similarity-based search and dense vector clustering algorithm developed by Meta in 2023.
② Fast search algorithm: capable of obtaining the nearest neighbor and the k-th nearest neighbor.
③ Allows for the search of multiple vectors at once (batch processing).
④ There is a trade-off between accuracy and speed.
⑤ Instead of minimizing Euclidean distance, it calculates by maximizing the maximum inner product.
⑥ Returns all elements contained within a given radius.
Input: 2021.12.12 11:37