Chapter 10. Overview of Deep Learning
Recommended Post: 【Algorithm】 Algorithm Index
1. Overview
2. Evaluation of Machine Learning Algorithms
a. Everything About Google AI: From A to Z
1. Overview
⑴ Meaning of Artificial Intelligence
① Data Science ⊃ Artificial Intelligence ⊃ Machine Learning ⊃ Artificial Neural Networks ⊃ Deep Learning
○ Artificial Intelligence (AI): A broad concept referring to computer intelligence that mimics human thinking.
○ Machine Learning: A technique that creates models suitable for solving problems by learning from data.
○ First defined by Arthur Samuel in 1959.
○ Artificial Neural Networks (ANN): A machine learning algorithm modeled after human neural networks.
○ Advantage 1. Can be trained on tasks difficult to specify concretely.
○ Advantage 2. Can compress a large amount of information (Autoencoder).
○ Deep Learning: Artificial neural networks with deep multi-layered structures.
○ Differs from other machine learning techniques in that features are defined by the computer rather than humans.
○ Recently shows superior performance compared to most traditional machine learning methods in various fields.
② Strong AI vs. Weak AI
③ Turing Test
○ A computer passes the test if a human converses with it for 5 minutes without realizing it is a machine.
○ Invented by Alan Turing in 1950.
⑵ Definition of Deep Learning
① Mathematical Formulation: A DNN (Deep Neural Network) can be expressed with the following formula:
○ yn = f(xn-1 · wi) + b
○ yn: Output vector of neurons in the n-th hidden layer.
○ xn-1: Output vector of neurons in the (n-1)-th hidden layer.
○ wi: Weight vector of neuron i.
○ b: Bias vector.
○ f: Activation function.
② Dimension: Refers to the size of the vector. For example, the dimension of v = (a1, …, an) is n.
③ Inputs: Represented as input vectors of data provided as input, expressed as x = (x1, ···, xℓ).
④ Weights: Weights connecting node i and node j, represented as ωij.
○ Represent synapses of the brain in neural networks; ωij form the matrix W.
⑤ Outputs: y = (y1, ···, yn) = y(x, W).
⑥ Targets: t = (t1, ···, tn), essential for supervised learning algorithms as they require true values for training.
⑦ Activation Function g(·): Determines neuron activation based on weighted input values.
○ Threshold Function.
⑧ Error: Calculates the distance between the algorithm’s output (y) and the actual target value (t) through errors.
⑨ Parameters and Hyperparameters
○ Parameters are measured or learned from data.
○ Parameters are often stored as part of the learned model.
○ Hyperparameters are values arbitrarily set for learning.
○ Examples of hyperparameters include learning rate, tree depth in decision trees, and the number of hidden layers in neural networks.
⑶ Characteristics of DNNs
① Approximation of nonlinear transformations.
② Various hidden layers.
③ Deep Learning Feed Forward: Operations proceed from input layer → hidden layer → output layer without backward feedback loops.
④ Backward Learning: Network weights are updated in reverse order, from output layer → hidden layer.
⑷ Feature Selection
① Fewer features reduce computational complexity and the risk of overfitting.
② Method 1. Selection using domain knowledge.
③ Method 2. Reduction via feature combination: PCA, etc.
④ Method 3. Selection via criteria/ranking systems.
○ Filter-based method: Selects features based on statistical criteria. Fast and simple, advantageous for large datasets.
○ Wrapper-based method: Finds optimal feature combinations by training the model multiple times. Computationally expensive and prone to overfitting.
⑸ Example: Applying deep learning models to MNIST (handwritten digit) data.
import tensorflow as tf
from tensorflow.keras import layers, models
# Load the MNIST dataset
mnist = tf.keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
# Normalize the pixel values of the images to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0
# Define a simple sequential model
def create_model():
model = models.Sequential([
layers.Flatten(input_shape=(28, 28)), # Flatten the 2D image to a 1D array
layers.Dense(128, activation='relu'), # First hidden layer with 128 neurons and ReLU activation
layers.Dropout(0.2), # Dropout layer to reduce overfitting
layers.Dense(10, activation='softmax') # Output layer with 10 neurons for 10 classes and softmax activation
])
return model
# Create the model
model = create_model()
# Compile the model
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# Train the model
model.fit(train_images, train_labels, epochs=5)
# Evaluate the model
test_loss, test_acc = model.evaluate(test_images, test_labels)
print(f'Test accuracy: {test_acc}')
⑹ Types
① Type 1. Supervised Learning
○ Definition: Learning a map x → y from a problem-answer pair {(xi, yi)}.
○ Examples: Conditional probability, regression functions, supervised classification.
○ 1-1. Semi-supervised Learning
○ Similar to supervised learning but only some answers are labeled.
○ 1-2. Imitation Learning
○ While supervised learning learns the relationship between inputs and outputs, imitation learning uses other example data to learn this relationship.
○ 1-3. Meta Learning
○ Learns the results or learning process of other machine learning algorithms.
② Type 2. Unsupervised Learning
○ Definition: Understanding the underlying structure of a feature vector x provided without labels.
○ Examples: Clustering, dimensionality reduction, topic modeling, DIP (deep image prior), style transfer.
○ The difference from Prior Learning
○ Typical CNN neural networks undergo prior learning with large datasets to create prediction models.
○ Prior learning is necessary when training/prediction happens simultaneously.
○ On the other hand, methods like DIP work in a self-taught manner and do not require prior learning.
③ Type 3. Reinforcement Learning
○ Definition: A method that learns the best action in each state based on rewards (π(action state)) given for actions taken under specific conditions.
○ Clues to the correct answer are implicitly provided in the form of rewards.
④ Type 4. Self-supervised Learning
○ Definition: Machine learning that proceeds by creating problems and answer sheets independently, like a toy example.
○ Similarities with supervised learning: Differentiates between test/training datasets.
○ Similarities with unsupervised learning: The model generates the answer sheet.
○ 4-1. Auto-associative Learning: Also known as autoencoder.
○ Definition: A neural network that compresses input data as much as possible and then restores the compressed data to its original form.
○ Encoder: Neural network from the input layer to the hidden layer.
○ Decoder: Neural network from the hidden layer to the output layer.
○ Encoders and decoders are trained together.
○ Effect 1. Data Compression: If the hidden layer has fewer nodes than the input layer, the network can compress the input data.
○ Effect 2. Data Abstraction: Converts complex data into multidimensional vectors, enabling classification and recombination of input data.
○ These multidimensional vectors are also called hidden layers or features.
○ PCA is a representative recombination algorithm.
○ 4-2. Masking-based
○ Example: iBOT (masking + distillation).
○ 4-3. Distillation-based
○ Example: DINO (distillation), iBOT (masking + distillation).
○ 4-4. Contrastive Learning
○ Definition: A method of learning by comparing target data with control data.
○ Applies information theory’s surprisal to define loss functions as: $-(1/N) × (Σ log Φ_target + Σ log (1-Φ_control))$, where Φ is a probability.
○ Examples: SimCLR, MoCo, BYOL.
○ 4-5. Foundation Model
○ Definition: A machine learning model operating in an inductive way, necessarily self-supervised.
○ Inductive way: The ability to infer (e.g., knowing “apple” is 사과 and “banana” is 바나나, one can guess “melon” is 멜론).
○ Difference with Generative Model: Generative model refers to models that create new example data.
○ Foundation models may or may not be generative.
○ DIP (deep image prior) is a generative model but not a foundation model because it does not work in an inductive way.
○ Foundation models are categorized as LLM, LMM, and diffusion models.
○ Paradigm Shift: Machine Learning (2000s, feature-centric) → Deep Learning (2010s, model-centric) → Foundation Model (2020s, data-centric).
Figure 1. Paradigm Shift on Machine Learning
○ 4-5-1. LLM (Large Language Model) and LMM (Large Multimodal Model).
○ 4-5-2. Diffusion Learning
○ Models that intentionally add noise to an original image and then generate the original image from the noisy image.
○ Generative model: Can generate various forms of AI-generated images.
○ Diffusion learning operates inductively, distinguishing it from DIP.
○ Diffusion models are also used in chemoinformatics (e.g., DiffPack, RFdiffusion).
○ 4-6. Transfer Learning
○ Algorithms that retrain a learned model by modifying its final output layer.
○ Allows the application of pretrained models to new tasks different from the originally intended task (e.g., Geneformer).
○ It does not necessarily have to be a self-supervised learning approach; for example, Step2Heart implemented using supervised learning.
○ Requires self-supervised learning to operate inductively for different tasks.
⑤ There are many other types.
○ Representation Learning
○ Active Learning
○ Online Learning, Incremental Learning, Never-ending Learning
○ Curriculum Learning: A method of learning from easy to difficult stages.
○ Few-shot Learning, One-shot Learning
○ Multi-instance Learning, Multi-label Learning, Distributional Learning
○ Metric Learning, Kernel Learning
○ Federated Learning: Training models on distributed datasets. Advantages include preserving data privacy and efficiently utilizing data from multiple sources.
⑺ Applications
① Games: Chess, AlphaGo, StarCraft.
② Chatbots: Watson, Siri, Alexa, Google Assistant.
○ Watson: The first AI system to defeat human participants on a TV program.
③ Image Analysis: TikTok, Snapchat.
④ Recommendation Systems: Spotify (music), Netflix (movies), YouTube (videos), TikTok (short videos).
⑤ Autonomous Driving: Waymo, Tesla.
⑥ Biological Research: Alphafold, Alphafold2.
2. Evaluation of Machine Learning Algorithms
⑴ Weight Space
Figure 2. Location of two neurons in weight space.
⑵ Curse of Dimensionality
① Hypersphere
○ Definition: A set of points in any n-dimensional space that are a distance of 1 from the origin.
○ The volume of a hypersphere approaches 0 as the dimension increases beyond n > 2π.
○ Vn = (2π/n) Vn-2.
○ Implication: As dimensions increase, more data enters the unit hypercube.
② Curse of Dimensionality
○ Phenomenon where the number of sample data needed for distribution analysis or model estimation increases exponentially with the number of dimensions.
○ Reason: The density of the same amount of data rapidly decreases as dimensions increase.
⑶ Overfitting
① Overtraining
Figure 3. Function generalizing even noise due to overfitting.
○ Overtraining in machine learning is as dangerous as undertraining.
○ Overtraining artificial neural networks causes overfitting, even learning errors and inaccuracies in data.
○ Predictive accuracy decreases.
② Strategies to avoid overfitting
○ Strategies to stop training before neural networks overfit.
○ Strategies to input error terms during each training iteration.
⑷ Neural Network Evaluation
① Definition: Evaluating how well a neural network generalizes using out-of-sample data.
② Dataset
○ Training Data: Data used for training within a given sample.
○ Testing Data: Data not used for training, used to evaluate the trained model.
○ Validation Set: A third dataset from a different population used to check for overfitting.
○ Training and testing data are classified as in-sample data.
○ Validation set is classified as out-of-sample data.
③ Dataset Selection
○ When sufficient data is available: Split into training : testing : validation as 50 : 25 : 25, 60 : 20 : 20, or 50 : 30 : 20.
○ When sufficient data is not available: Reuse certain data for training, testing, and validation. Typically done in three ways.
④ Cross-validation (rotation estimation, out-of-sample testing))
○ Definition: Resampling the given data to allocate training, testing, and validation sets.
○ Type 1. Exhaustive Cross-validation:
○ 1-1. Leave-one-out Cross-validation (LOOCV): Can lead to overfitting compared to m-fold cross-validation.
Figure 4. Leave-one-out Cross-validation.
○ 1-2. Leave-some-out Cross-validation.
○ Type 2. Non-exhaustive Cross-validation:
○ 2-1. Multi-fold (m-fold) Cross-validation:
Figure 5. m-fold Cross-validation.
Figure 6. m-fold Cross-validation.
○ 1st: Split the given sample into m parts.
○ 2nd: Use m-1 parts for parameter calculation.
○ 3rd: Use the remaining 1 part for performance evaluation.
○ 4th: Repeat for all different combinations m times.
○ 5th: Calculate the average to determine the final estimate.
○ Typically, 10-fold cross-validation is used.
○ 2-2. Holdout Method.
○ 2-3. Repeated Random Sub-sampling Validation.
○ Application 1. Early Stopping: Prevents the model from overfitting the given data.
Figure 7. Validation and Early Stopping.
○ Application 2. Hyperparameter Tuning: Hyperparameters are user-defined parameters.
○ **Application 3. Feature Subset Problem.
⑸ Confusion Matrix
Table 1. Confusion Matrix.
① “7” represents the number of times the classifier correctly predicted class C1.
② Accuracy = (7 + 8 + 9) ÷ (1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9) = 24/45.
① True Positive (TP): When the actual value is true and the predicted value is true. This means a genuinely positive case.
○ #TP = 7 + 8 + 9 = 24
② False Positive (FP): When the actual value is false and the predicted value is true. This means a falsely positive case.
○ #FP = 1 + 2 + 3 + 4 + 5 + 6 = 21.
③ True Negative (TN): When the actual value is false and the predicted value is false. This means a genuinely negative case.
○ #TN
④ False Negative (FN): When the actual value is true and the predicted value is false. This means a falsely negative case.
○ #FN
⑤ Accuracy = (#TP + #TN) ÷ (#TP + #FP + #TN + #FN).
○ Error Rate = 1 - Accuracy.
⑥ Sensitivity (True Positive Rate, TPR) or Recall = #TP ÷ (#TP + #FN).
○ Reversing the meaning of positive and negative switches sensitivity to specificity.
⑦ Specificity = #TN ÷ (#TN + #FP).
○ Reversing the meaning of positive and negative switches specificity to sensitivity.
⑧ Precision (Positive Predictive Value, PPV) = #TP ÷ (#TP + #FP).
⑨ Negative Predictive Value (NPV): TN ÷ (TN + FN).
⑩ False Positive Rate (FDR, false discovery rate): FP ÷ (TN + FP) = 1 - specificity.
⑪ F1 Score = 2 × Precision × Recall ÷ (Precision + Recall) = #TP ÷ [#TP + (#FN + #FP)/2].
○ Combines precision and recall into a single performance metric.
○ Ranges between 0 and 1.
○ Higher F1 scores indicate better performance with both high precision and recall.
⑫ Kappa Statistic (κ):
○ K = (Pr(a) - Pr(e)) / (1 - Pr(e)).
○ K: Kappa correlation coefficient.
○ Pr(a): Probability of agreement between predictions.
○ Pr(e): Probability of agreement by chance.
○ Measures agreement between two observers on categorical data.
○ Values range from 0 to 1, with 1 indicating perfect agreement and 0 indicating no agreement.
○ Explains that model evaluation results are not random beyond accuracy.
⑬ Matthews Correlation Coefficient (MCC).
① Adjusting the threshold typically shows an inverse relationship between sensitivity and specificity.
Figure 8. Trends of sensitivity and specificity with threshold.
② ROC Curve(receiver operating characteristic curve)
○ A graph visualizing sensitivity on the vertical axis and 1-specificity (FDR) on the horizontal axis.
Figure 9. ROC Curve.
○ The ideal scenario is sensitivity = 1 and specificity = 1.
○ AUC (Area Under the Curve): Ranges from 0 to 1. Higher values indicate better performance.
Figure 10. AUC calculation process
Figure 11. AUC calculation process
③ Concordance Index: Refers to the area under the ROC curve.
④ When the ROC curve is random, the concordance index = 0.5.
⑤ Concordance cannot exceed 1.
⑥ AUPRC (Area Under the Precision-Recall Curve)
○ Calculated using precision and recall instead of sensitivity and specificity, as used in AUROC.
○ Preferred over AUC when the number of positive cases (Class 1) and negative cases (Class 2) is imbalanced.
⑻ Factors Affecting Data Input
① GIGO (Garbage In, Garbage Out): Meaning that poor input data results in poor output.
② Imbalanced Dataset:
③ Outliers and Missing Values
Input: 2018.06.09 10:00.
Modified: 2024.04.02 16:01.