Korean, Edit

Chapter 15. Multilayer Perceptron (MLP, multilayer perceptron)

Recommended Post: 【Algorithm】 Algorithm Index


1. Universal Approximation Theorem

2. Multi-Layer Perceptron Algorithm

3. MLP Application



1. Universal Approximation Theorem

⑴ Proposed by Cybenko (1989) and Hornik (1991)

⑵ Details

① Hornik (1991) states that any bounded and normalized function f: ℝd → ℝ can be represented by a finite-size neural network with a single hidden layer, having the same activation function and a single linear output neuron (ref)

② Let 𝜙 be a bounded, continuous, and monotonically increasing activation function. Also, let 𝐾𝑑 be a compact subset of ℝ𝑑, and let 𝐶(𝐾𝑑) be a continuous function on 𝐾𝑑. Then, for any 𝜖 > 0, there exist some 𝑁 ∈ ℕ, real numbers 𝑣𝑖, 𝑏𝑖, ℝ𝑑, and vectors 𝜔𝑖 such that, when defined as follows,


스크린샷 2025-03-24 오전 11 42 47


The following is obtained:


스크린샷 2025-03-24 오전 11 43 06


③ The universal approximation theorem for the special case of sigmoid activation functions was already proven by Cybenko (1989)

⑶ Limitations

① This theory is theoretically interesting, but implementing it with a single hidden layer would require an excessively large number of neurons in the hidden layer, making it impractical: The effect of deep learning lies in depth, that is, in the number of hidden layers



2. Multi-Layer Perceptron Algorithm

⑴ Overview

① This section deals with a three-layer perceptron consisting of an input layer, hidden layer, and output layer


image

Figure 1. Two-layer perceptron


○ Input Layer: Generally, the input layer is not included in the number of perceptron layers

○ Hidden Layer

○ Output Layer

② Parameter Definition

○ L: Number of neurons in the input layer

○ M: Number of neurons in the hidden layer. Can be set as a hyperparameter

○ N: Number of neurons in the output layer

○ i: Iteration variable representing input layer neurons

○ j: Iteration variable representing hidden layer neurons

○ k: Iteration variable representing output layer neurons

○ xι: Node value of the ι-th input layer neuron, i.e., input value

○ x0: Defined as -1

○ hζ: Synaptic sum of the ζ-th hidden layer neuron, also represented as hζhidden to avoid confusion

○ aζ: Node value of the ζ-th hidden layer neuron (must be less than 1)

○ a0: Defined as -1

○ hκ: Synaptic sum of the κ-th output layer neuron, also represented as hκoutput to avoid confusion

○ yκ: Node value of the κ-th output layer neuron, i.e., output value (must be less than 1)

○ tκ: Target value of the κ-th output layer neuron, i.e., the value provided by the training data

○ v: Bias applied to hζ

○ vιζ: Synaptic weight between the ι-th input layer neuron and ζ-th hidden layer neuron

○ ω: Bias applied to hκ

○ ωζκ: Synaptic weight between the ζ-th hidden layer neuron and κ-th output layer neuron

v: Matrix containing all vιζ

w: Matrix containing all ωζκ

○ δh(ζ): Error value of the ζ-th hidden layer neuron

○ δo(κ): Error value of the κ-th output layer neuron

○ g(·): Activation function

③ Note that the node value is the result of applying the activation function to the synaptic sum

④ Meaning of Bias: Reflects the pattern of the class itself, independent of individual datasets

⑵ Activation Function

① Identity Function = x: Also known as a linear classifier

② Sigmoid σ(x) = 1 / (1 + e-σx)

○ Differentiable and the function value remains in [0, 1], which is why it has been traditionally used most often as an activation function

○ Similar to biological experimental data

○ Saturation curve can be adjusted by σ

○ When σ diverges to ∞, it becomes a hard limiter

○ When the absolute value of x becomes large, the slope becomes close to 0, which reduces the learning capability of the model

③ tanh(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))

④ arctan(x)

⑤ Strong threshold function: φβ(x) = 1 if x ≥ β

⑥ ReLU (rectified linear unit): max(0, x). Most frequently used

⑦ leaky ReLU: max(0.1x, x)

⑧ maxout: max(w1Tx + b1, w2Tx + b2)

⑨ elu (exponential linear unit): x if x ≥ 0; α(ex - 1) if x < 0

softmax


스크린샷 2025-03-24 오전 11 47 25


⑶ Loss Function: Also called error function or cost function

① Definition: A numerical measure of how accurate the predictions of the currently trained model are

Method 1. L2 Loss Function

○ Square the difference between the output and the target values, and use the sum of these squared differences, ( E ), as the error function to calculate the error.

○ Here, the coefficient ( \frac{1}{2} ) is used for convenience when differentiating; multiplying by ( \frac{1}{2} ) is not mandatory.

○ The loss function can also be defined in the form of an average rather than a sum, i.e., considering ( E/N ).

○ Therefore, the error function can be expressed as follows:


스크린샷 2025-03-24 오전 11 47 58


Method 2. L1 Loss Function

○ Depending on the magnitude of the output and target values, some values output negative and some output positive.

○ These may cancel out each other, causing the problem where E0 approaches zero.

Method 3. cross entropy

○ General Definition


스크린샷 2025-03-24 오전 11 55 52


○ Binary Classification


스크린샷 2025-03-24 오전 11 56 09


○ If y is represented as a one-hot vector [0, ···, 1, ···, 0], it can be expressed as follows:


스크린샷 2025-03-24 오전 11 56 24


Method 4. KLD(Kullback-Leibler divergence)


스크린샷 2025-03-24 오전 11 56 40


Method 5. DT (Delaunay triangulation)


스크린샷 2025-03-24 오전 11 57 08

Figure 2. Example of Delaunay triangulation


Other Types of Distance and Similarity

Backpropagation

① Definition : The process of calculating gradients for use in the gradient descent algorithm.

Example 1

Step 1. Adjust each weight ωζκ with fixed ζ and κ values, and update in the direction of -E(w).

○ The result from applying the chain rule and the definition of hκ are as follows:

○ Using the definition of hκ, the following can be calculated:

○ ∂E / ∂hκ term represents error or variation, which makes it important enough to consider separately.

○ This can be solved using the chain rule.

○ The output value of neuron κ in the output layer is as follows:

○ Here, g(·) can be considered a general function since the activation function can vary, including the sigmoid function.

○ Therefore, the update formula for output layer weight ωζκ is as follows:

Step 2. Adjust each weight vιζ with fixed ι and ζ values, and update in the direction of -E(v).

○ The result from applying the chain rule and the definition of hζ are as follows:

○ Using the definition of hζ, the following can be calculated:

○ ∂E / ∂hζ term is also important enough to consider separately.

○ This can be solved using the chain rule.

○ However, depending on the definition of synaptic sum, activation function, and weight, the following equation emerges:

○ This directly means the following:

○ If the sigmoid function is adopted as the activation function, the equation can be simplified as follows:

○ Therefore, it is derived as follows:

○ Thus, the update rule for vιζ is as follows:

Step 3. If there are more hidden layers between inputs and outputs, the same calculation can be applied.

○ However, it becomes increasingly difficult to track which function should be differentiated.

Example 2


image

Figure 3. Example of Gradient Descent Algorithm


Example 3

○ Overview

○ Does not require complex differentiation.

○ Works like a recurrence relation.

○ The opposite of backpropagation is the feed-forward algorithm.

○ Method


image

Figure 4. Example of backpropagation algorithm


Step 1. Forward pass

Step 2. Upstream gradient at the very end is 1

Step 3. Calculate local gradient : The gradient generated at the node itself.

Step 4. Calculate downstream gradient : Computed as upstream gradient × local gradient, based on the chain rule of calculus.

Step 5. Gradient implementation

○ Patterns in Gradient Flow


image

Figure 5. Patterns in Gradient Flow


○ Add gate : Gradient distributor

○ Mul gate : Swap multiplier

○ Copy gate : Gradient adder

○ Max gate : Gradient router

Gradient Descent Algorithm

① Overview

○ Definition : Method of updating weights using gradients obtained from backpropagation.

○ Also called the update algorithm.

② The loss function generally follows the L2 loss function.

③ Update equation represented in matrix form

④ Update equation for a single parameter

⑤ Limitations : Still unresolved issues

Problem 1. The surface may not be a convex function but a concave function or a saddle point.

Problem 2. The cost function must be differentiable.

○ 0/1 loss, for example, is not always differentiable.

○ Tries to resolve using an alternative differentiable function.

Problem 3. Learning near a local minimum is not fast.

Application 1. Stochastic Gradient Descent (SGD) : Proposed by Rumelhart et al. (1988)

○ Definition : Considers a randomly sampled subset to compensate for the time-consuming nature of full gradient descent.

Type 1. Mini-Batch SGD : Usually considers minibatches of 32, 64, 128, ···, 8192.

Type 2. SGD : The most extreme case updates parameters based on a single element’s result.

○ Time cost : Mini-Batch SGD < Full-Batch SGD < SGD

○ Mini-Batch SGD reduces time cost by generating batches and reducing the number of variables involved in gradient calculation.

○ SGD, however, incurs time cost when generating batches.

○ Optimum loss value : Full-Batch SGD < Mini-Batch SGD < SGD

○ Creating batches sacrifices accuracy to save time cost.

○ Batch gradient descent theoretically does not affect convergence value but introduces noise during training.

Type 3. Momentum

① Algorithm based on the physical law that an object accelerates when a force is applied in the gradient direction.

② Movement resembles a ball rolling toward the optimal point.

Type 4. Nesterov Momentum (NAG, Nesterov accelerated gradient)

① Calculates the gradient at the position pre-applied in the momentum direction.

Type 5. AdaGrad (adaptive gradient algorithm)

① Initially learns significantly when the loss function gradient is large but reduces the learning rate as it approaches the optimal point.

Type 6. Adam (adaptive moment estimation)

① Combines the advantages of momentum and AdaGrad methods.

② Moves like momentum but with less side-to-side shaking.

Type 7. RMSProp (root mean square prop)

① Uses an exponential moving average instead of simple accumulation to reflect the most recent gradients more strongly.

Type 8. NAG (Nesterov’s accelerated gradient) algorithm

Type 9. L-BFGS (limited memory BFGS) algorithm



3. MLP Applications

Example 1. Classification Algorithm

Example 2. Time-Series Prediction

Example 3. Auto-Associative Learning : Also called an Autoencoder



Input: 2018.06.09 10:01

Modified: 2021.11.21 22:29

results matching ""

    No results matching ""