Korean, Edit

Lecture 18. Regularization in Regression Analysis (regularization, penalization)

Recommended post: 【Statistics】 Statistics Table of Contents


1. Overview

2. MSPE

3. Technique 1. Ridge regression

4. Technique 2. Lasso regression

5. Technique 3. Elastic net

6. Technique 4. SelectFromModel



**1. Overview **

⑴ Problems in regression analysis: Mainly prominent when there are many regression variables

Multicollinearity

Underfitting: The model lacks flexibility and cannot properly learn the given data

Overfitting

○ In standard regression like OLS estimation, the model learns the noise in the sample, reducing predictive power

○ Learning the bias during training can actually improve predictive power

⑵ Regularization (penalization)

① To solve regression problems, a penalty term is added to the parameters

② Note that not applying regularization causes overfitting, while too much causes underfitting

③ Standardization of data must be performed

○ Features with large values have large coefficients, which may be overly penalized and shrink too much

○ Conversely, features with small values have small coefficients and may be under-penalized

④ Sometimes includes a process of optimizing parameters (e.g. penalty term weight) using a validation set

⑤ Expected results of regularization


image

Figure 1. Expected results of regularization



2. MSPE

⑴ Overview

① Error: Assuming e is squared error, h is hypothesis, and f is true function

Type 1. In-sample error: Also called training error. Similar to bias


image


Type 2. Out-of-sample error: Also called generalization error, MSPE. Similar to variance


image


Step 1. Build prediction model using given sample

Step 2. Compare predicted and actual values using data outside the sample (XOOS, YOOS)

○ Note: ŷ refers to prediction obtained using in-sample data

④ (Reference) bias-variance tradeoff

⑤ Best prediction quantity: Called oracle predictor. E(YOOS XOOS)

○ Prediction error in MSPE is as follows

○ Fundamental error: Cannot be improved. YOOS - E(YOOS XOOS)
○ Estimation error: Ŷ(XOOS) - E(YOOS XOOS)

⑵ MSPE estimator


스크린샷 2025-06-08 오전 11 24 06


① If β is known, MSPE = σu2 holds

② k/n may be large

⑶ Assumptions

Assumption 1. No multicollinearity

Assumption 2. (XOOS, YOOS) are randomly drawn from the same population

⑷ Transformations

① Standardization

○ (Xi1, ···, Xki, Yi*) are values extracted from original sample

○ Define Xji as (Xji* - μXj) / σXj

○ Dependent variable is transformed as Yσj ← Yσj - μY*

② Principle of shrinkage


스크린샷 2025-06-08 오전 11 24 44


○ Can reduce MSPE

○ Bias occurs instead: Tradeoff

○ Most famous example is James-Stein estimator

⑸ In-sample MSPE calculation: m-fold cross validation is commonly used

① 1st. Split given sample into m parts

② 2nd. Use m-1 parts to estimate parameters: Training data

③ 3rd. Use the remaining part to evaluate performance: Testing data

④ 4th. Repeat m times with different combinations

⑤ 5th. Take average to determine final estimator


스크린샷 2025-06-08 오전 11 25 30


⑥ Typically 10-fold cross validation is used

⑹ Out-of-sample root MSPE calculation

① Use model trained on in-sample data to evaluate performance on different sample

② This different sample is called validation set



3. Technique 1. Ridge regression

⑴ Overview

① Definition: Penalizes squared values to control model complexity. Penalty is a function of weights

② Also called L2 regularization

③ Introduced in 1962 by A. E. Hoerl to solve non-invertibility of the regression matrix


스크린샷 2025-06-08 오전 11 26 03


MAP learning for Gaussian distribution

⑵ Objective function

① Simple form


스크린샷 2025-06-08 오전 11 26 28


② PRSS (penalized residual sum of squares)


스크린샷 2025-06-08 오전 11 26 54


Case 1. Regression variables are uncorrelated

① Simple form: Can express relative to β̂j found when λ = 0


스크린샷 2025-06-08 오전 11 27 30


② Matrix form: Ridge objective function is convex, allowing easy solution via differentiation


스크린샷 2025-06-08 오전 11 27 51


Case 2. Regression variables are correlated: Must examine MSPE with respect to λRidge

① Bias-variance trade-off


image

Figure 2. General bias-variance trade-off


② λRidge is calculated via cross validation

③ λRidge = 0 fits best in-sample but not out-of-sample


image

Figure 3. Square root of MSPE according to λRidge


⑸ Characteristics of Ridge regression solution

① Even without invertible XtX, λ allows calculation of inverse

② Each λ gives one estimator

③ λ → 0: Overfitting. Reaches linear regression (OLS) solution

④ λ → ∞: Underfitting. Coefficients w approach 0 ( penalty on large coefficients)

Application 1. Soft order constraints: Eventually turns inequality constraint like   w   ≤ C into equality constraint


image


Application 2. Weight decay: Treat   w   like error-term and apply standard neural net update methods

① Standard gradient descent: w t - η∇Ein(w t)


image


Application 3. MAP (maximum a posteriori)

① Bayes rule


image


② General MAP learning: Recall case where “P(D) = constant” in Bayes rule


image


○ Assuming normal distribution: Assuming w is unrelated to prior except w0 and is small


image


③ MAP learning in Ridge regression


image


Application 4. Comparison with other methods


image

Figure 4. Comparison of predictive performance



## 4. Technique 2. Lasso regression

⑴ Overview

① Definition: Penalizes absolute values to control model complexity. Penalty is a function of weights

② Also called L1 regularization

MAP learning for Laplacian prior


image

Figure 5. Laplace probability density function


⑵ Objective function

① Simple form


스크린샷 2025-06-08 오전 11 30 43


② Matrix form


image


⑶ Solution of objective function: Compute MSPE with respect to λLasso


image

Figure 6. Square root of MSPE according to λLasso


① λLasso is calculated via cross validation

② Unlike Ridge, no general closed-form solution

⑷ Characteristics

① Useful when model has sparsity property: i.e., many coefficients are 0

② λ → 0: Reaches linear regression (OLS) solution. Best in-sample but poor out-of-sample

③ λ → ∞: Coefficients w approach 0 ( penalty on large coefficients)

Application 1. Principle of sparsity

① Laplace prior sets unimportant variables exactly to 0: Effectively removes unimportant variables


image

Figure 7. Changes in coefficients with shrinkage factor


Diagram of sparsity principle


image

Figure 8. Intuitive understanding of Lasso regression’s sparsity


○ Red ellipses connect points of equal MSE (mean squared error)

○ Blue region connects points with equal penalty

○ As λ increases, penalty increases, both Lasso and Ridge shrink

○ In Ridge, optimum occurs at point where red ellipse touches circular blue region: If not, solution lies nearer to origin with smaller penalty

○ In Lasso, if blue region is small, optimal solution occurs at point where some coefficients are 0: At this sharp point, movement along edge exits red ellipse (→ higher MSE)

○ Unlike Ridge, Lasso induces sparsity

Application 2. Comparison with other methods


image

Figure 9. Comparison of predictive performance



5. Technique 3. Elastic Net

⑴ A linear combination of lasso and ridge. Adds both sum of absolute values and squared values of weights as penalty terms


스크린샷 2025-06-08 오전 11 32 22


Parameter 1. Alpha (α): Controls mix ratio of L1 and L2 penalties. α = 1 is Lasso, α = 0 is Ridge

Parameter 2. Lambda (λ): Controls strength of penalty. Multiplies the entire regularization term



6. Technique 4. SelectFromModel

⑴ A method for selecting variables based on decision tree algorithms



Input: 2019.12.08 12:35

Edit: 2024.09.27 08:47

results matching ""

    No results matching ""