Chapter 15. Multilayer Perceptron (MLP, multilayer perceptron)
Recommended Post: 【Algorithm】 Algorithm Index
1. Universal Approximation Theorem
2. Multi-Layer Perceptron Algorithm
1. Universal Approximation Theorem
⑴ Proposed by Cybenko (1989) and Hornik (1991)
⑵ Details
① Hornik (1991) states that any bounded and normalized function f: ℝd → ℝ can be represented by a finite-size neural network with a single hidden layer, having the same activation function and a single linear output neuron (ref)
② Let 𝜙 be a bounded, continuous, and monotonically increasing activation function. Also, let 𝐾𝑑 be a compact subset of ℝ𝑑, and let 𝐶(𝐾𝑑) be a continuous function on 𝐾𝑑. Then, for any 𝜖 > 0, there exist some 𝑁 ∈ ℕ, real numbers 𝑣𝑖, 𝑏𝑖, ℝ𝑑, and vectors 𝜔𝑖 such that, when defined as follows,
The following is obtained:
③ The universal approximation theorem for the special case of sigmoid activation functions was already proven by Cybenko (1989)
⑶ Limitations
① This theory is theoretically interesting, but implementing it with a single hidden layer would require an excessively large number of neurons in the hidden layer, making it impractical: The effect of deep learning lies in depth, that is, in the number of hidden layers
2. Multi-Layer Perceptron Algorithm
⑴ Overview
① This section deals with a three-layer perceptron consisting of an input layer, hidden layer, and output layer
Figure 1. Two-layer perceptron
○ Input Layer: Generally, the input layer is not included in the number of perceptron layers
○ Hidden Layer
○ Output Layer
② Parameter Definition
○ L: Number of neurons in the input layer
○ M: Number of neurons in the hidden layer. Can be set as a hyperparameter
○ N: Number of neurons in the output layer
○ i: Iteration variable representing input layer neurons
○ j: Iteration variable representing hidden layer neurons
○ k: Iteration variable representing output layer neurons
○ xι: Node value of the ι-th input layer neuron, i.e., input value
○ x0: Defined as -1
○ hζ: Synaptic sum of the ζ-th hidden layer neuron, also represented as hζhidden to avoid confusion
○ aζ: Node value of the ζ-th hidden layer neuron (must be less than 1)
○ a0: Defined as -1
○ hκ: Synaptic sum of the κ-th output layer neuron, also represented as hκoutput to avoid confusion
○ yκ: Node value of the κ-th output layer neuron, i.e., output value (must be less than 1)
○ tκ: Target value of the κ-th output layer neuron, i.e., the value provided by the training data
○ v0ζ: Bias applied to hζ
○ vιζ: Synaptic weight between the ι-th input layer neuron and ζ-th hidden layer neuron
○ ω0κ: Bias applied to hκ
○ ωζκ: Synaptic weight between the ζ-th hidden layer neuron and κ-th output layer neuron
○ v: Matrix containing all vιζ
○ w: Matrix containing all ωζκ
○ δh(ζ): Error value of the ζ-th hidden layer neuron
○ δo(κ): Error value of the κ-th output layer neuron
○ g(·): Activation function
③ Note that the node value is the result of applying the activation function to the synaptic sum
④ Meaning of Bias: Reflects the pattern of the class itself, independent of individual datasets
⑵ Activation Function
① Identity Function = x: Also known as a linear classifier
② Sigmoid σ(x) = 1 / (1 + e-σx)
○ Differentiable and the function value remains in [0, 1], which is why it has been traditionally used most often as an activation function
○ Similar to biological experimental data
○ Saturation curve can be adjusted by σ
○ When σ diverges to ∞, it becomes a hard limiter
○ When the absolute value of x becomes large, the slope becomes close to 0, which reduces the learning capability of the model
③ tanh(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))
④ arctan(x)
⑤ Strong threshold function: φβ(x) = 1 if x ≥ β
⑥ ReLU (rectified linear unit): max(0, x). Most frequently used
⑦ leaky ReLU: max(0.1x, x)
⑧ maxout: max(w1Tx + b1, w2Tx + b2)
⑨ elu (exponential linear unit): x if x ≥ 0; α(ex - 1) if x < 0
⑩ softmax
⑶ Loss Function: Also called error function or cost function
① Definition: A numerical measure of how accurate the predictions of the currently trained model are
② Method 1. L2 Loss Function
○ Square the difference between the output and the target values, and use the sum of these squared differences, ( E ), as the error function to calculate the error.
○ Here, the coefficient ( \frac{1}{2} ) is used for convenience when differentiating; multiplying by ( \frac{1}{2} ) is not mandatory.
○ The loss function can also be defined in the form of an average rather than a sum, i.e., considering ( E/N ).
○ Therefore, the error function can be expressed as follows:
③ Method 2. L1 Loss Function
○ Depending on the magnitude of the output and target values, some values output negative and some output positive.
○ These may cancel out each other, causing the problem where E0 approaches zero.
④ Method 3. cross entropy
○ General Definition
○ Binary Classification
○ If y is represented as a one-hot vector [0, ···, 1, ···, 0], it can be expressed as follows:
⑤ Method 4. KLD(Kullback-Leibler divergence)
⑥ Method 5. DT (Delaunay triangulation)
Figure 2. Example of Delaunay triangulation
⑷ Backpropagation
① Definition : The process of calculating gradients for use in the gradient descent algorithm.
② Example 1
○ Step 1. Adjust each weight ωζκ with fixed ζ and κ values, and update in the direction of -E(w).
○ The result from applying the chain rule and the definition of hκ are as follows:
○ Using the definition of hκ, the following can be calculated:
○ ∂E / ∂hκ term represents error or variation, which makes it important enough to consider separately.
○ This can be solved using the chain rule.
○ The output value of neuron κ in the output layer is as follows:
○ Here, g(·) can be considered a general function since the activation function can vary, including the sigmoid function.
○ Therefore, the update formula for output layer weight ωζκ is as follows:
○ Step 2. Adjust each weight vιζ with fixed ι and ζ values, and update in the direction of -E(v).
○ The result from applying the chain rule and the definition of hζ are as follows:
○ Using the definition of hζ, the following can be calculated:
○ ∂E / ∂hζ term is also important enough to consider separately.
○ This can be solved using the chain rule.
○ However, depending on the definition of synaptic sum, activation function, and weight, the following equation emerges:
○ This directly means the following:
○ If the sigmoid function is adopted as the activation function, the equation can be simplified as follows:
○ Therefore, it is derived as follows:
○ Thus, the update rule for vιζ is as follows:
○ Step 3. If there are more hidden layers between inputs and outputs, the same calculation can be applied.
○ However, it becomes increasingly difficult to track which function should be differentiated.
③ Example 2
Figure 3. Example of Gradient Descent Algorithm
④ Example 3
○ Overview
○ Does not require complex differentiation.
○ Works like a recurrence relation.
○ The opposite of backpropagation is the feed-forward algorithm.
○ Method
Figure 4. Example of backpropagation algorithm
○ Step 1. Forward pass
○ Step 2. Upstream gradient at the very end is 1
○ Step 3. Calculate local gradient : The gradient generated at the node itself.
○ Step 4. Calculate downstream gradient : Computed as upstream gradient × local gradient, based on the chain rule of calculus.
○ Step 5. Gradient implementation
○ Patterns in Gradient Flow
Figure 5. Patterns in Gradient Flow
○ Add gate : Gradient distributor
○ Mul gate : Swap multiplier
○ Copy gate : Gradient adder
○ Max gate : Gradient router
⑸ Gradient Descent Algorithm
① Overview
○ Definition : Method of updating weights using gradients obtained from backpropagation.
○ Also called the update algorithm.
② The loss function generally follows the L2 loss function.
③ Update equation represented in matrix form
④ Update equation for a single parameter
⑤ Limitations : Still unresolved issues
○ Problem 1. The surface may not be a convex function but a concave function or a saddle point.
○ Problem 2. The cost function must be differentiable.
○ 0/1 loss, for example, is not always differentiable.
○ Tries to resolve using an alternative differentiable function.
○ Problem 3. Learning near a local minimum is not fast.
⑥ Application 1. Stochastic Gradient Descent (SGD) : Proposed by Rumelhart et al. (1988)
○ Definition : Considers a randomly sampled subset to compensate for the time-consuming nature of full gradient descent.
○ Type 1. Mini-Batch SGD : Usually considers minibatches of 32, 64, 128, ···, 8192.
○ Type 2. SGD : The most extreme case updates parameters based on a single element’s result.
○ Time cost : Mini-Batch SGD < Full-Batch SGD < SGD
○ Mini-Batch SGD reduces time cost by generating batches and reducing the number of variables involved in gradient calculation.
○ SGD, however, incurs time cost when generating batches.
○ Optimum loss value : Full-Batch SGD < Mini-Batch SGD < SGD
○ Creating batches sacrifices accuracy to save time cost.
○ Batch gradient descent theoretically does not affect convergence value but introduces noise during training.
⑹ Type 3. Momentum
① Algorithm based on the physical law that an object accelerates when a force is applied in the gradient direction.
② Movement resembles a ball rolling toward the optimal point.
⑺ Type 4. Nesterov Momentum (NAG, Nesterov accelerated gradient)
① Calculates the gradient at the position pre-applied in the momentum direction.
⑻ Type 5. AdaGrad (adaptive gradient algorithm)
① Initially learns significantly when the loss function gradient is large but reduces the learning rate as it approaches the optimal point.
⑼ Type 6. Adam (adaptive moment estimation)
① Combines the advantages of momentum and AdaGrad methods.
② Moves like momentum but with less side-to-side shaking.
⑽ Type 7. RMSProp (root mean square prop)
① Uses an exponential moving average instead of simple accumulation to reflect the most recent gradients more strongly.
⑾ Type 8. NAG (Nesterov’s accelerated gradient) algorithm
⑿ Type 9. L-BFGS (limited memory BFGS) algorithm
3. MLP Applications
⑴ Example 1. Classification Algorithm
⑵ Example 2. Time-Series Prediction
⑶ Example 3. Auto-Associative Learning : Also called an Autoencoder
Input: 2018.06.09 10:01
Modified: 2021.11.21 22:29