Korean, Edit

Chapter 6. Classification Algorithm

Recommended Readings : Algorithms


1. Overview

2. Type 1. Linear Classifier

3. Type 2. K-Nearest Neighbors Algorithm

4. Type 3. Decision Trees

5. Type 4. Base Classifier

6. Type 5. Domain Adaptation Classification

7. Type 6.** Deep Learning-Based Classification

8. Other Algorithms


a. Github (R, Python)

b. Calibrated Classification Model



1. Overview

⑴ Definition

① y ~ X (where | { y } | < ∞ )

② Belongs to the supervised algorithm category

Calibrated Classification Model



2. Type 1. Linear Classifier

⑴ Overview

① Utilizes non-linear regression analysis

Drawback 1. Linear classifiers can learn only one template per class

Drawback 2. Linear classifiers can implement only linear decision boundaries among decision boundaries

④ (Reference) Hyperplane: Represents multi-dimensional surfaces beyond 3D for multiple regression

1-1. SVM (Support Vector Machine)

① Overview

○ Definition: Technique for classifying points into two categories by setting the optimal hyperplane (maximum margin hyperplane; maximum margin hyperplane) that maximizes the margin in high-dimensional or infinite-dimensional space


image

Figure 1. SVM and Hyperplane


○ Multiple support vectors are possible: Setting multiple hyperplanes can classify points into multiple sets

○ Linearly inseparable classification problems can be solved by mapping from the original space to a higher-dimensional space

○ Advantages: Frequently used method with well-established research. Strong resistance to overfitting compared to other models. High accuracy.

○ Disadvantages: Requires providing a test set for supervised learning. Slower than other models.

② Formula


image

Figure 2. SVM Formula


스크린샷 2024-10-28 오후 10 40 18


○ Significance of the final formula: It is easy to apply an optimization algorithm and can be extended to classification problems that are not linearly separable by applying a kernel.

○ By introducing the slack variable ξi = 1 - yi (wxi + b), some data points that violate the constraints are allowed.

③ Kernels used in SVM

○ Linear Kernel: Basic type of kernel, 1-dimensional and faster than other functions

○ Polynomial Kernel: A generalized formula for linear kernels, less efficient in terms of effectiveness and accuracy

○ Gaussian Kernel: Generally used kernel, used when there is no prior knowledge of data

○ Gaussian RBF Kernel: The most commonly used kernel, typically used for nonlinear data. In the following equation, the vector ℓ represents a landmark (e.g., the center of a data point).


스크린샷 2024-11-28 오전 10 42 24


○ Sigmoid Kernel: Preferred as a kernel in artificial neural networks

④ Implementation in R


## data
print(head(iris))

## training
library(e1071)
train <- sample(1:150, 100)
sv <- svm(Species ~., data= iris, subset = train, type = "C-classification")

## testing
pred = predict(sv, iris[-train, ])
print(head(pred))

## review 
tt <- table(iris$Species[-train], pred)
print(tt) # confusion matrix


1-2. nu-SVM

1-3. Probability Regression Model: LPM, probit, logit, etc.

① R Code

## R code

# Train the logistic regression model
model <- glm(species ~ ., data = x_train, family = binomial())

# Predict probabilities
prob_estimates <- predict(model, x_test, type = "response")



3. Type 2. K-Nearest Neighbors Algorithm (K-NN)

⑴ Overview

① A method to label a given test data point using K nearest training data points

② Example: YouTube recommendation system

③ Doesn’t work well beyond 5 dimensions

Step 1. Data Loading and Architecture Design


import numpy as np
import tensorflow as tf

class NearestNeighbor(object):
    ...


(Xtr, Ytr), (Xte, Yte) = tf.keras.datasets.mnist.load_data()

print(Xtr.shape) # (60000, 28, 28)
print(Ytr.shape) # (60000,)
print(Xte.shape) # (10000, 28, 28)
print(Yte.shape) # (10000,)

Xtr_rows = Xtr.reshape(Xtr.shape[0], 28*28)
Xte_rows = Xte.reshape(Xte.shape[0], 28*28)

print(Xtr_rows.shape) # (60000, 784)
print(Xte_rows.shape) # (10000, 784)

nn = NearestNeighbor()
nn.train(Xtr_rows, Ytr)

Yte_predict = nn.predict(Xte_rows, 'L2')
print ('accuracy: %f' + str(np.mean(Yte_predict == Yte)) )

# L1 : accuracy = 47.29%
# L2 : accuracy = 27.24%
# dot : accuracy = 9.9%


Step 2. NN Class Design


class NearestNeighbor(object):
    def __init__(self):
        pass

    def train(self, X, Y):
        self.Xtr = X
        self.Ytr = Y

    def predict(self, X, dist_metric='L2'):
        num_test = X.shape[0]
        Ypred = np.zeros(num_test, dtype = self.Ytr.dtype)
        
        for i in range(num_test):
            if dist_metric=='L1': ## L1 distance
                distances = np.sum(np.abs(self.Xtr - X[i, :]), axis = 1) 
            elif dist_metric=='L2': ## L2 distance
                distances = np.sum(np.square(self.Xtr - X[i, :]), axis = 1) 
            elif dist_metric=='dot': ## dot distance
                distances = np.dot(self.Xtr, X[i, :])
            
            min_index = np.argmin(distances)
            Ypred[i] = self.Ytr[min_index]
            
            if i%100 == 0:
                print(i)
        return Ypred


① The training phase is merely a data storage phase

Factor 1. Designing the distance function

Distance Function(distance function, metric): Defines distance


image


1-1. L1 Distance


image


1-2. L2 Distance: Euclidean distance using the Pythagorean theorem (standard)

1-3. p-norm

1-4. Dot Product

1-5. KL Distance (Kullback-Leibler Divergence)(Kullback-Leibler divergence)

1-6. Other distance functions

○ Results from L1 and L2 distances are similar: L1 38.6%, L2 35.4%

Factor 2. 1-NN vs. k-NN

○ The edges of each region become smoother as k increases

○ If k is too small, it is not robust and may respond to outliers or noise when labeling

○ Ambiguous regions appear as k increases

Step 3. Hyperparameter Tuning: Try various k values and choose the one that yields the best results


Xval_rows = Xtr_rows[:1000, :] # take first 1000 for validation
Yval = Ytr[:1000]
Xtr_rows = Xtr_rows[1000:, :] # keep last 59,000 for train
Ytr = Ytr[1000:]

validation_accuracies = []
for k in [1, 3, 5, 10, 20, 50, 100]:
    nn = NearestNeighbor()
    nn.train(Xtr_rows, Ytr)
    Yval_predict = nn.predict(Xval_rows, k = k)
    acc = np.mean(Yval_predict == Yval)
    print 'accuracy: %f' % (acc,)
    validation_accuracies.append((k, acc))


Step 4. Improvement

① Drawbacks

○ Using KNN at the pixel level may not result in true similarity, leading to poor performance

○ Test time is very slow

○ Curse of dimensionality: As the dimension increases, the number of data required to maintain the same distance between examples increases geometrically

○ Too sparse examples among examples can lead to sloppy classification results

② Improvements

○ Data Normalization: Normalize data to have a mean of 0 and a standard deviation of 1

○ Dimensionality Reduction: Techniques like PCA, NCA, random projection, etc.

○ Cross-Validation: Computationally expensive, commonly used methods include 3-fold, 5-fold, and 10-fold cross-validation

○ Approximate Nearest Neighbors: Reduces accuracy but improves speed (e.g., FLANN)

③ Despite these improvements, KNN is rarely used in practice

2-1. Mutual Nearest Neighbors Algorithm

⑺ Comparison between K-NN and K-Means Clustering


Item K-NN K-Means Clustering
Type Supervised Learning Unsupervised Learning
Meaning of k Number of Nearest Neighbors Number of Clusters
Optimization Techniques Cross Validation, Confusion Matrix Elbow Method, Silhouette Method
Application Classification Clustering

Table. 1. Comparison of K-NN and K-Means Clustering



4. Type 3. Decision Trees

⑴ Overview

① Models for predicting output values for given input values, including single trees and regression tree models

② Technique of drawing decision trees with decision rules using classification functions for analysis

③ The purity of child nodes increases compared to the purity of parent nodes

④ Calculation results are directly shown in decision trees

⑤ Disadvantage: Predictive errors are significant near the boundary points of separation because continuous variables are treated as discontinuous values

3-1. Simple Decision Trees

① Overview

Advantage 1. Good representation for given data

Advantage 2. Interpretable and fast

Advantage 3. Easily handles the following data: categorical, mixed data, missing data, multiple classes

Drawback 1. Potential for overfitting, meaning it is vulnerable to outliers.

Drawback 2. The tree becomes too deep.

○ Solution: Tree pruning

② Terminology

○ Root Node

○ Internal Node (Node): Forms conditional statements for attributes

○ Branching: Splits from internal nodes based on attribute values

○ Leaf Node (Terminal Node, Decision Node): Outputs (class assignment)

③ Learning Decision Trees

Condition 1. Not too big or too small

Condition 2. Occam’s Razor Theory: Simplicity is the best

Condition 3. Complexity depends on depth

○ Learning the smallest and simplest decision tree is an NP-complete problem (Hyafil & Rivest (1976))

○ Decision trees can be learned greedily using heuristic methods


### BuildTree (D): Greedy training of a decision tree ###
# Input: A data set D = (X, Y).
# Output: A decision tree.

if LeafCondition(D) then
    fn = FindBestPrediction(D)
else
    jn, tn = FindBestSplid(D)
    DL = {(x(i), y(i)) :  x(i)_(jn) < tn} and
    DR = {(x(i), y(i)) :  x(i)_(jn)  tn}
    leftChild = BuildTree(DL)
    rightChild = BuildTree(DR)
end if


Step 1. Score all splits

Step 2. Greedily find the condition with the maximum information gain

Step 3. Divide the dataset with that condition and recursively perform Step 1 and Step 2 in each subtree

Step 4. Stopping criteria

④ R Code


## R code

library(rpart)

# Train the decision tree model
model <- rpart(species ~ ., data = x_train, method = "class")

# Predict probabilities
prob_estimates <- predict(model, x_test)


Application 1: Pruning

Application 2: Information Gain Ratio

3-2. Bagging (Bootstrap Aggregation)

① Definition: Determines classification results for different datasets obtained by sampling with replacement by voting


image

Figure 3. Bagging Illustration


② Bootstrap: A series of processes that proceed automatically once started, not limited to decision trees


image


Advantage 1. Reduces the variance of unstable processes like trees, improving predictability

Advantage 2. Smoothes the boundaries of decision trees

3-3. Random Forest

① Definition: Utilizes m features randomly selected from each sub-dataset for bagging

○ Example: When learning, randomly select only some features such as city, population, average income, number of newborns, and number of newlyweds among various features

Feature 1. De-correlating: Increases individuality of each sub-dataset by using only some features

Feature 2. Unused datasets in each sub-dataset can be used to validate out-of-sample errors

④ Performance: Random forest > bagging > decision tree

○ Accuracy: Comparable or higher than Adaboost

○ Robustness: Relatively robust against outliers and noise

○ Speed: Faster than bagging and boosting

○ Simplicity and easy parallelization

○ Currently the best-performing algorithm in classification algorithm research

⑤ R Code


## R code

library(randomForest)

# Train the random forest model
model <- randomForest(species ~ ., data = x_train)

# Predict probabilities
prob_estimates <- predict(model, x_test, type = "prob")


3-4. C4.5 and C5.0

① Developed by Australian researcher J. Ross Quinlan

② The initial version was developed as ID 3 (Iterative Dichotomizer 3) in 1986

③ Uses training data when using pruning

④ Uses entropy index as a measure of impurity with discrete dependent variables

3-5. CHAID

① Uses the chi-squared statistic and employs multiway splits in the decision tree algorithm

3-6. CART (Classification and Regression Tree)

① Performs classification by repeatedly bifurcating each independent variable to form a binary tree

3-7. AdaBoost

3-8. QUEST



5. Type 4. Base Classifier

⑴ Naïve Bayes classifier

① Overview

○ An algorithm that assumes conditional independence between variables in the data.

○ An algorithm that combines prior information about classes with information extracted from data, using Bayes’ theorem to classify which class a given data point belongs to.

○ It can be used as a solution for problems in text classification, such as determining which category (e.g., spam, economics, sports, etc.) a document belongs to.

② Formalization

○ Premise: There are K classes C1, C2, ···, CK.

○ Objective: Given a new sample x = (x1, ···, xN) composed of multiple features, determine which class it belongs to.

○ Formalization: Similar to MLE (Maximum Likelihood Estimation).


스크린샷 2024-10-21 오전 10 31 29


③ Limitations

○ Violation of conditional independence: If there are significant correlations between features (e.g., age and history of heart disease), related variables may be multiplied multiple times, leading to an overestimation of their effects.

○ Inability to exclude the influence of irrelevant features.

○ Solution: By using methods such as PCR or mRMR to discard unnecessary variables and retain only the essential ones, these limitations can be mitigated.

⑵ Bayesian Network



6. Type 5. Domain Adaptation Classification

⑴ Overview

① Definition : Obtaining universal knowledge from one domain and applying it to another domain.

② One domain : Referred to as the source domain, tangential domain, etc. Large dataset size.

③ Another domain : Referred to as the target domain, specific domain, etc. Small dataset size.

④ Differences from generalization

○ Domain adaptation classification : Differences in format, data structure, etc., between source and target.

○ Generalization : Format, data structure, etc., are the same between source and target.

⑵ Types

① UDA (Unsupervised Domain Adaptation) : When there are no labels in the target domain.


image

Figure 4. Diagram of UDA


○ In other words, the structure of applying features obtained from the source domain directly to the target domain.

② SSL (Supervised Domain Adaptation) : When there are labels in the target domain.

③ SSDA (Semi-Supervised Domain Adaptation) : When the target domain has labels for only some cases.


image

Figure 5. Diagram of SSDA


Process 1: Augmentation

Process 2: Random logit interpolation

Process 3: Distribution alignment

Process 4: Relative confidence threshold

Process 5: Loss function

Reason for being state-of-the-art (SOTA): Because of the trade-off, as can be confirmed in the ‘motivation’ section in the link.

⑶ Objective Functions

① Aims to reduce the discrepancy between domains.

② Types of discrepancies

○ MMD (Maximum Mean Discrepancy)

○ DANN (Domain Adversarial Neural Network)

○ MCD (Maximum Classifier Discrepancy)

Applications

① Divergence Domain Adaptation

② Adversarial Domain Adaptation

○ Generative Adversarial Network (GAN)

○ Definition : Two AI systems learning while competing with each other (game learning).

○ Generative Model : Utilizes knowledge learned based on various features of existing data to generate new data.

○ Discriminative Model : Evaluates how similar the generated new data is to the existing data.

○ Automatically creates results that are close to reality by exchanging ideas between the generative and discriminative models.

③ Reconstruction Domain Adaptation



7. Type 6. Deep Learning-Based Classification

⑴ Deep Learning (Part 1, Part 2, Part 3)

6-1. 1D CNN



8. Other Algorithms

⑴ Faiss

① A similarity-based search and dense vector clustering algorithm developed by Meta in 2023.

② Fast search algorithm: capable of obtaining the nearest neighbor and the k-th nearest neighbor.

③ Allows for the search of multiple vectors at once (batch processing).

④ There is a trade-off between accuracy and speed.

⑤ Instead of minimizing Euclidean distance, it calculates by maximizing the maximum inner product.

⑥ Returns all elements contained within a given radius.



Input : 2021.12.12 11:37

results matching ""

    No results matching ""