Lecture 18. Regularization in Regression Analysis (regularization, penalization)

Recommended post: 【Statistics】 Statistics Table of Contents

1. Overview

2. MSPE

3. Technique 1. Ridge regression

4. Technique 2. Lasso regression

5. Technique 3. Elastic net

6. Technique 4. SelectFromModel

1. Overview

⑴ Problems in regression analysis: Mainly prominent when there are many regression variables

① Multicollinearity

② Underfitting: The model lacks flexibility and cannot properly learn the given data

③ Overfitting

○ In standard regression like OLS estimation, the model learns the noise in the sample, reducing predictive power

○ Learning the bias during training can actually improve predictive power

⑵ Regularization (penalization)

① To solve regression problems, a penalty term is added to the parameters

② Note that not applying regularization causes overfitting, while too much causes underfitting

③ Standardization of data must be performed

○ Features with large values have large coefficients, which may be overly penalized and shrink too much

○ Conversely, features with small values have small coefficients and may be under-penalized

④ Sometimes includes a process of optimizing parameters (e.g. penalty term weight) using a validation set

⑤ Expected results of regularization

Figure 1. Expected results of regularization

2. MSPE

⑴ Overview

① Error: Assuming e is squared error, h is hypothesis, and f is true function

② Type 1. In-sample error: Also called training error. Similar to bias

③ Type 2. Out-of-sample error: Also called generalization error, MSPE. Similar to variance

○ Step 1. Build prediction model using given sample

○ Step 2. Compare predicted and actual values using data outside the sample (XOOS, YOOS)

○ Note: ŷ refers to prediction obtained using in-sample data

④ (Reference) bias-variance tradeoff

⑤ Best prediction quantity: Called oracle predictor. E(YOOS XOOS)

○ Prediction error in MSPE is as follows

○ Fundamental error: Cannot be improved. YOOS - E(YOOS XOOS)

○ Estimation error: Ŷ(XOOS) - E(YOOS XOOS)

⑵ MSPE estimator

① If β is known, MSPE = σu2 holds

② k/n may be large

⑶ Assumptions

① Assumption 1. No multicollinearity

② Assumption 2. (XOOS, YOOS) are randomly drawn from the same population

⑷ Transformations

① Standardization

○ (Xi1, ···, Xki, Yi*) are values extracted from original sample

○ Define Xji as (Xji* - μXj) / σXj

○ Dependent variable is transformed as Yσj ← Yσj - μY*

② Principle of shrinkage

○ Can reduce MSPE

○ Bias occurs instead: Tradeoff

○ Most famous example is James-Stein estimator

⑸ In-sample MSPE calculation: m-fold cross validation is commonly used

① 1^st. Split given sample into m parts

② 2^nd. Use m-1 parts to estimate parameters: Training data

③ 3^rd. Use the remaining part to evaluate performance: Testing data

④ 4^th. Repeat m times with different combinations

⑤ 5^th. Take average to determine final estimator

⑥ Typically 10-fold cross validation is used

⑹ Out-of-sample root MSPE calculation

① Use model trained on in-sample data to evaluate performance on different sample

② This different sample is called validation set

3. Technique 1. Ridge regression

⑴ Overview

① Definition: Penalizes squared values to control model complexity. Penalty is a function of weights

② Also called L2 regularization

③ Introduced in 1962 by A. E. Hoerl to solve non-invertibility of the regression matrix

④ MAP learning for Gaussian distribution

⑵ Objective function

① Simple form

② PRSS (penalized residual sum of squares)

⑶ Case 1. Regression variables are uncorrelated

① Simple form: Can express relative to β̂j found when λ = 0

② Matrix form: Ridge objective function is convex, allowing easy solution via differentiation

⑷ Case 2. Regression variables are correlated: Must examine MSPE with respect to λ_Ridge

① Bias-variance trade-off

Figure 2. General bias-variance trade-off

② λ_Ridge is calculated via cross validation

③ λ_Ridge = 0 fits best in-sample but not out-of-sample

Figure 3. Square root of MSPE according to λ_Ridge

⑸ Characteristics of Ridge regression solution

① Even without invertible XtX, λ allows calculation of inverse

② Each λ gives one estimator

③ λ → 0: Overfitting. Reaches linear regression (OLS) solution

④ λ → ∞: Underfitting. Coefficients w approach 0 (∵ penalty on large coefficients)

⑹ Application 1. Soft order constraints: Eventually turns inequality constraint like

≤ C into equality constraint

⑺ Application 2. Weight decay: Treat

like error-term and apply standard neural net update methods

① Standard gradient descent: w t - η∇Ein(w t)

⑻ Application 3. MAP (maximum a posteriori)

① Bayes rule

② General MAP learning: Recall case where “P(D) = constant” in Bayes rule

○ Assuming normal distribution: Assuming w is unrelated to prior except w0 and is small

③ MAP learning in Ridge regression

⑼ Application 4. Comparison with other methods

Figure 4. Comparison of predictive performance

## 4. Technique 2. Lasso regression

⑴ Overview

① Definition: Penalizes absolute values to control model complexity. Penalty is a function of weights

② Also called L1 regularization

③ MAP learning for Laplacian prior

Figure 5. Laplace probability density function

⑵ Objective function

① Simple form

② Matrix form

⑶ Solution of objective function: Compute MSPE with respect to λ_Lasso

Figure 6. Square root of MSPE according to λ_Lasso

① λ_Lasso is calculated via cross validation

② Unlike Ridge, no general closed-form solution

⑷ Characteristics

① Useful when model has sparsity property: i.e., many coefficients are 0

② λ → 0: Reaches linear regression (OLS) solution. Best in-sample but poor out-of-sample

③ λ → ∞: Coefficients w approach 0 (∵ penalty on large coefficients)

⑸ Application 1. Principle of sparsity

① Laplace prior sets unimportant variables exactly to 0: Effectively removes unimportant variables

Figure 7. Changes in coefficients with shrinkage factor

② Diagram of sparsity principle

Figure 8. Intuitive understanding of Lasso regression’s sparsity

○ Red ellipses connect points of equal MSE (mean squared error)

○ Blue region connects points with equal penalty

○ As λ increases, penalty increases, both Lasso and Ridge shrink

○ In Ridge, optimum occurs at point where red ellipse touches circular blue region: If not, solution lies nearer to origin with smaller penalty

○ In Lasso, if blue region is small, optimal solution occurs at point where some coefficients are 0: At this sharp point, movement along edge exits red ellipse (→ higher MSE)

○ Unlike Ridge, Lasso induces sparsity

⑹ Application 2. Comparison with other methods

Figure 9. Comparison of predictive performance

5. Technique 3. Elastic Net

⑴ A linear combination of lasso and ridge. Adds both sum of absolute values and squared values of weights as penalty terms

⑵ Parameter 1. Alpha (α): Controls mix ratio of L1 and L2 penalties. α = 1 is Lasso, α = 0 is Ridge

⑶ Parameter 2. Lambda (λ): Controls strength of penalty. Multiplies the entire regularization term

6. Technique 4. SelectFromModel

⑴ A method for selecting variables based on decision tree algorithms

Input: 2019.12.08 12:35

Edit: 2024.09.27 08:47

1768

Lecture 18. Regularization in Regression Analysis (regularization, penalization)

1. Overview

2. MSPE

3. Technique 1. Ridge regression

5. Technique 3. Elastic Net

6. Technique 4. SelectFromModel

results matching ""

No results matching ""

Lecture 18. Regularization in Regression Analysis (regularization, penalization)

**1. Overview **

2. MSPE

3. Technique 1. Ridge regression

5. Technique 3. Elastic Net

6. Technique 4. SelectFromModel

results matching ""

No results matching ""

1. Overview