Korean, Edit

Lecture 18. Regularization in Regression Analysis (regularization, penalization)

Recommended post : 【Statistics】 Statistics Table of Contents


1. Overview

2. MSPE

3. Technique 1. Ridge regression

4. Technique 2. Lasso regression

5. Technique 3. Elastic net

6. Technique 4. SelectFromModel



**1. Overview **

⑴ Problems in regression analysis : Mainly prominent when there are many regression variables

Multicollinearity

Underfitting : The model lacks flexibility and cannot properly learn the given data

Overfitting

○ In standard regression like OLS estimation, the model learns the noise in the sample, reducing predictive power

○ Learning the bias during training can actually improve predictive power

⑵ Regularization (penalization)

① To solve regression problems, a penalty term is added to the parameters

② Note that not applying regularization causes overfitting, while too much causes underfitting

③ Standardization of data must be performed

○ Features with large values have large coefficients, which may be overly penalized and shrink too much

○ Conversely, features with small values have small coefficients and may be under-penalized

④ Sometimes includes a process of optimizing parameters (e.g. penalty term weight) using a validation set

⑤ Expected results of regularization

Figure. 1. Expected results of regularization



2. MSPE

⑴ Overview

① Error : Assuming e is squared error, h is hypothesis, and f is true function

Type 1. In-sample error : Also called training error. Similar to bias

Type 2. Out-of-sample error : Also called generalization error, MSPE. Similar to variance

Step 1. Build prediction model using given sample

Step 2. Compare predicted and actual values using data outside the sample (XOOS, YOOS)

○ Note: ŷ refers to prediction obtained using in-sample data

④ (Reference) bias-variance tradeoff

⑤ Best prediction quantity : Called oracle predictor. E(YOOS XOOS)

○ Prediction error in MSPE is as follows

○ Fundamental error : Cannot be improved. YOOS - E(YOOS XOOS)
○ Estimation error : Ŷ(XOOS) - E(YOOS XOOS)

⑵ MSPE estimator

① If β is known, MSPE = σu2 holds

② k/n may be large

⑶ Assumptions

Assumption 1. No multicollinearity

Assumption 2. (XOOS, YOOS) are randomly drawn from the same population

⑷ Transformations

① Standardization

○ (Xi1, ···, Xki, Yi*) are values extracted from original sample

○ Define Xji as (Xji* - μXj) / σXj

○ Dependent variable is transformed as Yσj ← Yσj - μY*

② Principle of shrinkage

○ Can reduce MSPE

○ Bias occurs instead : Tradeoff

○ Most famous example is James-Stein estimator

⑸ In-sample MSPE calculation : m-fold cross validation is commonly used

① 1st. Split given sample into m parts

② 2nd. Use m-1 parts to estimate parameters : Training data

③ 3rd. Use the remaining part to evaluate performance : Testing data

④ 4th. Repeat m times with different combinations

⑤ 5th. Take average to determine final estimator

⑥ Typically 10-fold cross validation is used

⑹ Out-of-sample root MSPE calculation

① Use model trained on in-sample data to evaluate performance on different sample

② This different sample is called validation set



3. Technique 1. Ridge regression

⑴ Overview

① Definition : Penalizes squared values to control model complexity. Penalty is a function of weights

② Also called L2 regularization

③ Introduced in 1962 by A. E. Hoerl to solve non-invertibility of the regression matrix

MAP learning for Gaussian distribution

⑵ Objective function

① Simple form

② PRSS (penalized residual sum of squares)

Case 1. Regression variables are uncorrelated

① Simple form : Can express relative to β̂j found when λ = 0

② Matrix form : Ridge objective function is convex, allowing easy solution via differentiation

Case 2. Regression variables are correlated : Must examine MSPE with respect to λRidge

① Bias-variance trade-off

Figure. 2. General bias-variance trade-off

② λRidge is calculated via cross validation

③ λRidge = 0 fits best in-sample but not out-of-sample

Figure. 3. Square root of MSPE according to λRidge

⑸ Characteristics of Ridge regression solution

① Even without invertible XtX, λ allows calculation of inverse

② Each λ gives one estimator

③ λ → 0 : Overfitting. Reaches linear regression (OLS) solution

④ λ → ∞ : Underfitting. Coefficients w approach 0 ( penalty on large coefficients)

Application 1. Soft order constraints : Eventually turns inequality constraint like   w   ≤ C into equality constraint
Application 2. Weight decay : Treat   w   like error-term and apply standard neural net update methods

① Standard gradient descent : w t - η∇Ein(w t)

Application 3. MAP (maximum a posteriori)

① Bayes rule

② General MAP learning : Recall case where “P(D) = constant” in Bayes rule

○ Assuming normal distribution : Assuming w is unrelated to prior except w0 and is small

③ MAP learning in Ridge regression

Application 4. Comparison with other methods

Figure. 4. Comparison of predictive performance



## 4. Technique 2. Lasso regression

⑴ Overview

① Definition : Penalizes absolute values to control model complexity. Penalty is a function of weights

② Also called L1 regularization

MAP learning for Laplacian prior

Figure. 5. Laplace probability density function

⑵ Objective function

① Simple form

② Matrix form

⑶ Solution of objective function : Compute MSPE with respect to λLasso

Figure. 6. Square root of MSPE according to λLasso

① λLasso is calculated via cross validation

② Unlike Ridge, no general closed-form solution

⑷ Characteristics

① Useful when model has sparsity property : i.e., many coefficients are 0

② λ → 0 : Reaches linear regression (OLS) solution. Best in-sample but poor out-of-sample

③ λ → ∞ : Coefficients w approach 0 ( penalty on large coefficients)

Application 1. Principle of sparsity

① Laplace prior sets unimportant variables exactly to 0 : Effectively removes unimportant variables

Figure. 7. Changes in coefficients with shrinkage factor

Diagram of sparsity principle

Figure. 8. Intuitive understanding of Lasso regression’s sparsity

○ Red ellipses connect points of equal MSE (mean squared error)

○ Blue region connects points with equal penalty

○ As λ increases, penalty increases, both Lasso and Ridge shrink

○ In Ridge, optimum occurs at point where red ellipse touches circular blue region : If not, solution lies nearer to origin with smaller penalty

○ In Lasso, if blue region is small, optimal solution occurs at point where some coefficients are 0 : At this sharp point, movement along edge exits red ellipse (→ higher MSE)

○ Unlike Ridge, Lasso induces sparsity

Application 2. Comparison with other methods

Figure. 9. Comparison of predictive performance



5. Technique 3. Elastic Net

⑴ A linear combination of lasso and ridge. Adds both sum of absolute values and squared values of weights as penalty terms

Parameter 1. Alpha (α) : Controls mix ratio of L1 and L2 penalties. α = 1 is Lasso, α = 0 is Ridge

Parameter 2. Lambda (λ) : Controls strength of penalty. Multiplies the entire regularization term



6. Technique 4. SelectFromModel

⑴ A method for selecting variables based on decision tree algorithms



Input: 2019.12.08 12:35

Edit: 2024.09.27 08:47

results matching ""

    No results matching ""