Chapter 10. Overview of Deep Learning

Recommended Post: 【Algorithm】 Algorithm Index

1. Overview

2. Evaluation

3. History

1. Overview

⑴ Meaning of Artificial Intelligence

① Data Science ⊃ Artificial Intelligence ⊃ Machine Learning ⊃ Artificial Neural Networks ⊃ Deep Learning

○ Artificial Intelligence (AI): A broad concept referring to computer intelligence that mimics human thinking.

○ Machine Learning: A technique that creates models suitable for solving problems by learning from data.

○ First defined by Arthur Samuel in 1959.

○ Examples: SVM, adaboost, ANN

○ Artificial Neural Networks (ANN): A machine learning algorithm modeled after human neural networks.

○ Advantage 1. Can be trained on tasks difficult to specify concretely.

○ Advantage 2. Can compress a large amount of information (Autoencoder).

○ Deep Learning: Artificial neural networks with deep multi-layered structures.

○ Differs from other machine learning techniques in that features are defined by the computer rather than humans.

○ Recently shows superior performance compared to most traditional machine learning methods in various fields.

② Strong AI vs. Weak AI

③ Turing Test

○ A computer passes the test if a human converses with it for 5 minutes without realizing it is a machine.

○ Invented by Alan Turing in 1950.

⑵ Definition of Deep Learning

① Mathematical Formulation: A DNN (Deep Neural Network) can be expressed with the following formula:

○ y_n = f(x_n-1 · w_i) + b

○ y_n: Output vector of neurons in the n-th hidden layer.

○ x_n-1: Output vector of neurons in the (n-1)-th hidden layer.

○ w_i: Weight vector of neuron i.

○ b: Bias vector.

○ f: Activation function.

② Dimension: Refers to the size of the vector. For example, the dimension of v = (a₁, …, a_n) is n.

③ Inputs: Represented as input vectors of data provided as input, expressed as x = (x₁, ···, x_ℓ).

④ Weights: Weights connecting node i and node j, represented as ω_ij.

○ Represent synapses of the brain in neural networks; ω_ij form the matrix W.

⑤ Outputs: y = (y₁, ···, y_n) = y(x, W).

⑥ Targets: t = (t₁, ···, t_n), essential for supervised learning algorithms as they require true values for training.

⑦ Activation Function g(·): Determines neuron activation based on weighted input values.

○ Threshold Function.

⑧ Error: Calculates the distance between the algorithm’s output (y) and the actual target value (t) through errors.

⑨ Parameters and Hyperparameters

○ Parameters are measured or learned from data.

○ Parameters are often stored as part of the learned model.

○ Hyperparameters are values arbitrarily set for learning.

○ Examples of hyperparameters include learning rate, tree depth in decision trees, and the number of hidden layers in neural networks.

⑶ Characteristics of DNNs

① Approximation of nonlinear transformations.

② Various hidden layers.

③ Deep Learning Feed Forward: Operations proceed from input layer → hidden layer → output layer without backward feedback loops.

④ Backward Learning: Network weights are updated in reverse order, from output layer → hidden layer.

⑷ Feature Selection

① Fewer features reduce computational complexity and the risk of overfitting.

② Method 1. Selection using domain knowledge.

③ Method 2. Reduction via feature combination: PCA, etc.

④ Method 3. Selection via criteria/ranking systems.

○ Filter-based method: Selects features based on statistical criteria. Fast and simple, advantageous for large datasets.

○ Wrapper-based method: Finds optimal feature combinations by training the model multiple times. Computationally expensive and prone to overfitting.

⑤ Method 4. The method of converting a categorical variable to a continuous variable

○ Score-Function (SF)

○ DARN

○ MuProp

○ Straight-Through (ST)

○ Slope-Annealed ST

○ Gumbel-Softmax

○ ST Gumbel-Softmax

⑸ Example: Applying deep learning models to MNIST (handwritten digit) data.

import tensorflow as tf
from tensorflow.keras import layers, models

# Load the MNIST dataset
mnist = tf.keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Normalize the pixel values of the images to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0

# Define a simple sequential model
def create_model():
    model = models.Sequential([
        layers.Flatten(input_shape=(28, 28)),  # Flatten the 2D image to a 1D array
        layers.Dense(128, activation='relu'),  # First hidden layer with 128 neurons and ReLU activation
        layers.Dropout(0.2),                   # Dropout layer to reduce overfitting
        layers.Dense(10, activation='softmax')  # Output layer with 10 neurons for 10 classes and softmax activation
    ])
    return model

# Create the model
model = create_model()

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Train the model
model.fit(train_images, train_labels, epochs=5)

# Evaluate the model
test_loss, test_acc = model.evaluate(test_images, test_labels)

print(f'Test accuracy: {test_acc}')

⑹ Types

① Type 1. Supervised Learning

○ Definition: Learning a map x → y from a problem-answer pair {(x_i, y_i)}.

○ Examples: Conditional probability, regression functions, supervised classification.

○ 1-1. Semi-supervised Learning

○ Similar to supervised learning but only some answers are labeled.

○ 1-2. Imitation Learning

○ While supervised learning learns the relationship between inputs and outputs, imitation learning uses other example data to learn this relationship.

○ 1-3. Meta Learning

○ Learns the results or learning process of other machine learning algorithms.

② Type 2. Unsupervised Learning

○ Definition: Understanding the underlying structure of a feature vector x provided without labels.

○ Examples: Clustering, dimensionality reduction, topic modeling, DIP (deep image prior), style transfer.

○ The difference from Prior Learning

○ Typical CNN neural networks undergo prior learning with large datasets to create prediction models.

○ Prior learning is necessary when training/prediction happens simultaneously.

○ On the other hand, methods like DIP work in a self-taught manner and do not require prior learning.

③ Type 3. Reinforcement Learning

○ Definition: A method that learns the best action in each state based on rewards (π(action state)) given for actions taken under specific conditions.

○ Clues to the correct answer are implicitly provided in the form of rewards.

④ Type 4. Self-supervised Learning

○ Definition: Machine learning that proceeds by creating problems and answer sheets independently, like a toy example.

○ Similarities with supervised learning: Differentiates between test/training datasets.

○ Similarities with unsupervised learning: The model generates the answer sheet.

○ 4-1. Auto-associative Learning: Also known as autoencoder.

○ Definition: A neural network that compresses input data as much as possible and then restores the compressed data to its original form.

○ Encoder: Neural network from the input layer to the hidden layer.

○ Decoder: Neural network from the hidden layer to the output layer.

○ Encoders and decoders are trained together.

○ Effect 1. Data Compression: If the hidden layer has fewer nodes than the input layer, the network can compress the input data.

○ Effect 2. Data Abstraction: Converts complex data into multidimensional vectors, enabling classification and recombination of input data.

○ These multidimensional vectors are also called hidden layers or features.

○ PCA is a representative recombination algorithm.

○ 4-2. Masking-based

○ Example: iBOT (masking + distillation).

○ 4-3. Distillation-based

○ Example: DINO (distillation), iBOT (masking + distillation).

○ 4-4. Contrastive Learning

○ Definition: A method of learning by comparing target data with control data.

○ Applies information theory’s surprisal to define loss functions as: $-(1/N) × (Σ log Φ_target + Σ log (1-Φ_control))$, where Φ is a probability.

○ Examples: SimCLR, MoCo, BYOL.

○ 4-5. Foundation Model

○ Definition: A machine learning model operating in an inductive way, necessarily self-supervised.

○ Inductive way: The ability to infer (e.g., knowing “apple” is 사과 and “banana” is 바나나, one can guess “melon” is 멜론).

○ Difference with Generative Model: Generative model refers to models that create new example data.

○ Foundation models may or may not be generative.

○ DIP (deep image prior) is a generative model but not a foundation model because it does not work in an inductive way.

○ Foundation models are categorized as LLM, LMM, and diffusion models.

○ Paradigm Shift: Machine Learning (2000s, feature-centric) → Deep Learning (2010s, model-centric) → Foundation Model (2020s, data-centric).

Figure 1. Paradigm Shift on Machine Learning

○ 4-5-1. LLM (Large Language Model) and LMM (Large Multimodal Model).

○ 4-5-2. Diffusion Learning

○ Models that intentionally add noise to an original image and then generate the original image from the noisy image.

○ Generative model: Can generate various forms of AI-generated images.

○ Diffusion learning operates inductively, distinguishing it from DIP.

○ Diffusion models are also used in chemoinformatics (e.g., DiffPack, RFdiffusion).

○ 4-6. Transfer Learning

○ Algorithms that retrain a learned model by modifying its final output layer.

○ Allows the application of pretrained models to new tasks different from the originally intended task (e.g., Geneformer).

○ It does not necessarily have to be a self-supervised learning approach; for example, Step2Heart implemented using supervised learning.

○ Requires self-supervised learning to operate inductively for different tasks.

⑤ There are many other types.

○ Representation Learning

○ Active Learning

○ Online Learning, Incremental Learning, Never-ending Learning

○ Curriculum Learning: A method of learning from easy to difficult stages.

○ Few-shot Learning, One-shot Learning

○ Multi-instance Learning, Multi-label Learning, Distributional Learning

○ Metric Learning, Kernel Learning

○ Federated Learning: Training models on distributed datasets. Advantages include preserving data privacy and efficiently utilizing data from multiple sources.

⑺ Applications

① Games: Chess, AlphaGo, StarCraft.

② Chatbots: Watson, Siri, Alexa, Google Assistant.

○ Watson: The first AI system to defeat human participants on a TV program.

③ Image Analysis: TikTok, Snapchat.

④ Recommendation Systems: Spotify (music), Netflix (movies), YouTube (videos), TikTok (short videos).

⑤ Autonomous Driving: Waymo, Tesla.

⑥ Biological Research: Alphafold, Alphafold2.

⑻ Related software

① Tensorflow

② PyTorch

③ Hugging Face

④ Keras

⑤ H2O.ai

2. Evaluation

⑴ Weight Space

Figure 2. Location of two neurons in weight space.

⑵ Curse of Dimensionality

① Hypersphere

○ Definition: A set of points in any n-dimensional space that are a distance of 1 from the origin.

○ The volume of a hypersphere approaches 0 as the dimension increases beyond n > 2π.

○ V_n = (2π/n) V_n-2.

○ Implication: As dimensions increase, more data enters the unit hypercube.

② Curse of Dimensionality

○ Phenomenon where the number of sample data needed for distribution analysis or model estimation increases exponentially with the number of dimensions.

○ Reason: The density of the same amount of data rapidly decreases as dimensions increase.

⑶ Overfitting

① Overtraining

Figure 3. Function generalizing even noise due to overfitting.

○ Overtraining in machine learning is as dangerous as undertraining.

○ Overtraining artificial neural networks causes overfitting, even learning errors and inaccuracies in data.

○ Predictive accuracy decreases.

② Strategies to avoid overfitting

○ Strategies to stop training before neural networks overfit.

○ Strategies to input error terms during each training iteration.

⑷ Neural Network Evaluation

① Definition: Evaluating how well a neural network generalizes using out-of-sample data.

② Dataset

○ Training Data: Data used for training within a given sample.

○ Testing Data: Data not used for training, used to evaluate the trained model.

○ Validation Set: A third dataset from a different population used to check for overfitting.

○ Training and testing data are classified as in-sample data.

○ Validation set is classified as out-of-sample data.

③ Dataset Selection

○ When sufficient data is available: Split into training : testing : validation as 50 : 25 : 25, 60 : 20 : 20, or 50 : 30 : 20.

○ When sufficient data is not available: Reuse certain data for training, testing, and validation. Typically done in three ways.

④ Cross-validation (rotation estimation, out-of-sample testing))

○ Definition: Resampling the given data to allocate training, testing, and validation sets.

○ Type 1. Exhaustive Cross-validation:

○ 1-1. Leave-one-out Cross-validation (LOOCV): Can lead to overfitting compared to m-fold cross-validation.

Figure 4. Leave-one-out Cross-validation.

○ 1-2. Leave-some-out Cross-validation.

○ Type 2. Non-exhaustive Cross-validation:

○ 2-1. Multi-fold (m-fold) Cross-validation:

Figure 5. m-fold Cross-validation.

Figure 6. m-fold Cross-validation.

○ 1^st: Split the given sample into m parts.

○ 2^nd: Use m-1 parts for parameter calculation.

○ 3^rd: Use the remaining 1 part for performance evaluation.

○ 4^th: Repeat for all different combinations m times.

○ 5^th: Calculate the average to determine the final estimate.

○ Typically, 10-fold cross-validation is used.

○ 2-2. Holdout Method.

○ 2-3. Repeated Random Sub-sampling Validation.

○ Application 1. Early Stopping: Prevents the model from overfitting the given data.

Figure 7. Validation and Early Stopping.

○ Application 2. Hyperparameter Tuning: Hyperparameters are user-defined parameters.

○ **Application 3. Feature Subset Problem.

⑸ Confusion Matrix

Table 1. Confusion Matrix.

① “7” represents the number of times the classifier correctly predicted class C1.

② Accuracy = (7 + 8 + 9) ÷ (1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9) = 24/45.

⑹ Accuracy Metrics

① True Positive (TP): When the actual value is true and the predicted value is true. This means a genuinely positive case.

○ #TP = 7 + 8 + 9 = 24

② False Positive (FP): When the actual value is false and the predicted value is true. This means a falsely positive case.

○ #FP = 1 + 2 + 3 + 4 + 5 + 6 = 21.

③ True Negative (TN): When the actual value is false and the predicted value is false. This means a genuinely negative case.

○ #TN

④ False Negative (FN): When the actual value is true and the predicted value is false. This means a falsely negative case.

○ #FN

⑤ Accuracy = (#TP + #TN) ÷ (#TP + #FP + #TN + #FN).

○ Error Rate = 1 - Accuracy.

⑥ Sensitivity (True Positive Rate, TPR) or Recall = #TP ÷ (#TP + #FN).

○ Reversing the meaning of positive and negative switches sensitivity to specificity.

⑦ Specificity = #TN ÷ (#TN + #FP).

○ Reversing the meaning of positive and negative switches specificity to sensitivity.

⑧ Precision (Positive Predictive Value, PPV) = #TP ÷ (#TP + #FP).

⑨ Negative Predictive Value (NPV): TN ÷ (TN + FN).

⑩ False Positive Rate (FDR, false discovery rate): FP ÷ (TN + FP) = 1 - specificity.

⑪ F1 Score = 2 × Precision × Recall ÷ (Precision + Recall) = #TP ÷ [#TP + (#FN + #FP)/2].

○ Combines precision and recall into a single performance metric.

○ Ranges between 0 and 1.

○ Higher F1 scores indicate better performance with both high precision and recall.

⑫ Kappa Statistic (κ):

○ K = (Pr(a) - Pr(e)) / (1 - Pr(e)).

○ K: Kappa correlation coefficient.

○ Pr(a): Probability of agreement between predictions.

○ Pr(e): Probability of agreement by chance.

○ Measures agreement between two observers on categorical data.

○ Values range from 0 to 1, with 1 indicating perfect agreement and 0 indicating no agreement.

○ Explains that model evaluation results are not random beyond accuracy.

⑬ Matthews Correlation Coefficient (MCC).

⑺ Concordance

① Adjusting the threshold typically shows an inverse relationship between sensitivity and specificity.

Figure 8. Trends of sensitivity and specificity with threshold.

② ROC Curve(receiver operating characteristic curve)

○ A graph visualizing sensitivity on the vertical axis and 1-specificity (FDR) on the horizontal axis.

Figure 9. ROC Curve.

○ The ideal scenario is sensitivity = 1 and specificity = 1.

○ AUC (Area Under the Curve): Ranges from 0 to 1. Higher values indicate better performance.

Figure 10. AUC calculation process

Figure 11. AUC calculation process

③ Concordance Index: Refers to the area under the ROC curve.

④ When the ROC curve is random, the concordance index = 0.5.

⑤ Concordance cannot exceed 1.

⑥ AUPRC (Area Under the Precision-Recall Curve)

○ Calculated using precision and recall instead of sensitivity and specificity, as used in AUROC.

○ Preferred over AUC when the number of positive cases (Class 1) and negative cases (Class 2) is imbalanced.

⑻ Factors Affecting Data Input

① GIGO (Garbage In, Garbage Out): Meaning that poor input data results in poor output.

② Imbalanced Dataset:

③ Outliers and Missing Values

3. History

⑴ 1958: Frank Rosenblatt introduced the Perceptron, a simple neural network for binary classification.

⑵ 1969: Minsky & Papert published Perceptrons, highlighting limitations and sparking the first AI “winter”.

⑶ 1982: John Hopfield introduced the Hopfield network, reviving interest in neural networks.

⑷ 1986: Geoffrey Hinton and colleagues popularized backpropagation, enabling training of deep neural networks.

Input: 2018.06.09 10:00.

Modified: 2024.04.02 16:01.

1138

Chapter 10. Overview of Deep Learning

1. Overview

2. Evaluation

3. History

results matching ""

No results matching ""