Lecture 18. Regularization in Regression Analysis (regularization, penalization)
Recommended post : 【Statistics】 Statistics Table of Contents
1. Overview
2. MSPE
3. Technique 1. Ridge regression
4. Technique 2. Lasso regression
5. Technique 3. Elastic net
6. Technique 4. SelectFromModel
**1. Overview **
⑴ Problems in regression analysis : Mainly prominent when there are many regression variables
② Underfitting : The model lacks flexibility and cannot properly learn the given data
③ Overfitting
○ In standard regression like OLS estimation, the model learns the noise in the sample, reducing predictive power
○ Learning the bias during training can actually improve predictive power
⑵ Regularization (penalization)
① To solve regression problems, a penalty term is added to the parameters
② Note that not applying regularization causes overfitting, while too much causes underfitting
③ Standardization of data must be performed
○ Features with large values have large coefficients, which may be overly penalized and shrink too much
○ Conversely, features with small values have small coefficients and may be under-penalized
④ Sometimes includes a process of optimizing parameters (e.g. penalty term weight) using a validation set
⑤ Expected results of regularization
Figure. 1. Expected results of regularization
2. MSPE
⑴ Overview
① Error : Assuming e is squared error, h is hypothesis, and f is true function
② Type 1. In-sample error : Also called training error. Similar to bias
③ Type 2. Out-of-sample error : Also called generalization error, MSPE. Similar to variance
○ Step 1. Build prediction model using given sample
○ Step 2. Compare predicted and actual values using data outside the sample (XOOS, YOOS)
○ Note: ŷ refers to prediction obtained using in-sample data
④ (Reference) bias-variance tradeoff
⑤ Best prediction quantity : Called oracle predictor. E(YOOS XOOS)
○ Prediction error in MSPE is as follows
○ Fundamental error : Cannot be improved. YOOS - E(YOOS XOOS)
○ Estimation error : Ŷ(XOOS) - E(YOOS XOOS)
⑵ MSPE estimator
① If β is known, MSPE = σu2 holds
② k/n may be large
⑶ Assumptions
① Assumption 1. No multicollinearity
② Assumption 2. (XOOS, YOOS) are randomly drawn from the same population
⑷ Transformations
① Standardization
○ (Xi1, ···, Xki, Yi*) are values extracted from original sample
○ Define Xji as (Xji* - μXj) / σXj
○ Dependent variable is transformed as Yσj ← Yσj - μY*
② Principle of shrinkage
○ Can reduce MSPE
○ Bias occurs instead : Tradeoff
○ Most famous example is James-Stein estimator
⑸ In-sample MSPE calculation : m-fold cross validation is commonly used
① 1st. Split given sample into m parts
② 2nd. Use m-1 parts to estimate parameters : Training data
③ 3rd. Use the remaining part to evaluate performance : Testing data
④ 4th. Repeat m times with different combinations
⑤ 5th. Take average to determine final estimator
⑥ Typically 10-fold cross validation is used
⑹ Out-of-sample root MSPE calculation
① Use model trained on in-sample data to evaluate performance on different sample
② This different sample is called validation set
3. Technique 1. Ridge regression
⑴ Overview
① Definition : Penalizes squared values to control model complexity. Penalty is a function of weights
② Also called L2 regularization
③ Introduced in 1962 by A. E. Hoerl to solve non-invertibility of the regression matrix
⑵ Objective function
① Simple form
② PRSS (penalized residual sum of squares)
⑶ Case 1. Regression variables are uncorrelated
① Simple form : Can express relative to β̂j found when λ = 0
② Matrix form : Ridge objective function is convex, allowing easy solution via differentiation
⑷ Case 2. Regression variables are correlated : Must examine MSPE with respect to λRidge
① Bias-variance trade-off
Figure. 2. General bias-variance trade-off
② λRidge is calculated via cross validation
③ λRidge = 0 fits best in-sample but not out-of-sample
Figure. 3. Square root of MSPE according to λRidge
⑸ Characteristics of Ridge regression solution
① Even without invertible XtX, λ allows calculation of inverse
② Each λ gives one estimator
③ λ → 0 : Overfitting. Reaches linear regression (OLS) solution
④ λ → ∞ : Underfitting. Coefficients w approach 0 (∵ penalty on large coefficients)
⑹ Application 1. Soft order constraints : Eventually turns inequality constraint like | w | ≤ C into equality constraint |
⑺ Application 2. Weight decay : Treat | w | like error-term and apply standard neural net update methods |
① Standard gradient descent : w t - η∇Ein(w t)
⑻ Application 3. MAP (maximum a posteriori)
① Bayes rule
② General MAP learning : Recall case where “P(D) = constant” in Bayes rule
○ Assuming normal distribution : Assuming w is unrelated to prior except w0 and is small
③ MAP learning in Ridge regression
⑼ Application 4. Comparison with other methods
Figure. 4. Comparison of predictive performance
## 4. Technique 2. Lasso regression
⑴ Overview
① Definition : Penalizes absolute values to control model complexity. Penalty is a function of weights
② Also called L1 regularization
Figure. 5. Laplace probability density function
⑵ Objective function
① Simple form
② Matrix form
⑶ Solution of objective function : Compute MSPE with respect to λLasso
Figure. 6. Square root of MSPE according to λLasso
① λLasso is calculated via cross validation
② Unlike Ridge, no general closed-form solution
⑷ Characteristics
① Useful when model has sparsity property : i.e., many coefficients are 0
② λ → 0 : Reaches linear regression (OLS) solution. Best in-sample but poor out-of-sample
③ λ → ∞ : Coefficients w approach 0 (∵ penalty on large coefficients)
⑸ Application 1. Principle of sparsity
① Laplace prior sets unimportant variables exactly to 0 : Effectively removes unimportant variables
Figure. 7. Changes in coefficients with shrinkage factor
② Diagram of sparsity principle
Figure. 8. Intuitive understanding of Lasso regression’s sparsity
○ Red ellipses connect points of equal MSE (mean squared error)
○ Blue region connects points with equal penalty
○ As λ increases, penalty increases, both Lasso and Ridge shrink
○ In Ridge, optimum occurs at point where red ellipse touches circular blue region : If not, solution lies nearer to origin with smaller penalty
○ In Lasso, if blue region is small, optimal solution occurs at point where some coefficients are 0 : At this sharp point, movement along edge exits red ellipse (→ higher MSE)
○ Unlike Ridge, Lasso induces sparsity
⑹ Application 2. Comparison with other methods
Figure. 9. Comparison of predictive performance
5. Technique 3. Elastic Net
⑴ A linear combination of lasso and ridge. Adds both sum of absolute values and squared values of weights as penalty terms
⑵ Parameter 1. Alpha (α) : Controls mix ratio of L1 and L2 penalties. α = 1 is Lasso, α = 0 is Ridge
⑶ Parameter 2. Lambda (λ) : Controls strength of penalty. Multiplies the entire regularization term
6. Technique 4. SelectFromModel
⑴ A method for selecting variables based on decision tree algorithms
Input: 2019.12.08 12:35
Edit: 2024.09.27 08:47