Korean, Edit

Chapter 16. Linear Regression Analysis

Higher category: 【Statistics】 Statistics Overview


1. regression analysis

2. simple linear regression model

3. multiple linear regression model



1. regression analysis 

⑴ regression analysis: representing a particular variable as a dependency of one or multiple other variables

① more precisely, y ~ X (assuming y ∈ ℝ) 

○ included in a supervised algorithm

(note) classification : y ~ X (assyming | { y } | < ∞ ) 

② Specific variables: The name dependent variable is representative, but there are several names

○ response variable (response variable, response variable)

○ Outcome variable

○ Target variable

○ Output variable

○ Predicted variable

③ Other variables: The name independent variable is representative, but there are several names

○ Experimental variables

○ Explanatory variable

○ Predictor variable

○ Regressor

○ Covariate

○ Controlled variable

○ Managed variable

○ Exposure variable

○ Risk factor

○ Input variable

○ Variable (feature)

⑵ (comparison) cross-analysis and analysis of variance 

① regression analysis: independent variables are measurable variables. dependent variables are measurable variables

○ regression analysis shows the causal relationship of an independent variable to a dependent variable

○ the proof of actual causality is not required because the purpose is the prediction itself 

○ example : the length of an uncalcified bone for a 6-year-old child predicts additional height to grow, but is not causal

② cross analysis: independent variables are categorical (classified) variables. dependent variables are categorical (classified) variables

○ cross analysis is simply a representation of the correlation between variables

③ analysis of variance: independent variables are categorical (classified) variables. dependent variables are measurable variables

⑶ simple regression analysis and multiple regression analysis

① simple regression analysis: regression with one independent variable

② multiple regression analysis: regression with more than one independent variables

⑷ Variable Selection Methods

① Forward Selection

○ Step 1. Start with a constant model that only includes the intercept

○ Step 2. Sequentially add independent variables considered important to the model

② Backward Elimination

○ Step 1. Start with a model that includes all candidate independent variables

○ Step 2. Remove variables one by one, starting with the one that has the least impact based on the sum of squares

○ Step 3. Continue removing independent variables until there are no more statistically insignificant variables

○ Step 4. Select the model at this stage

③ Stepwise Method

○ Step-by-Step Addition: If the importance of existing variables weakens due to the addition of a new variable, remove the affected variable

○ Stepwise Elimination: Review which variables are removed and stop when there are no more to remove

⑸ Model Selection Criteria

① Overview

○ Method of penalizing the complexity of the model

○ Calculate AIC and BIC for all candidate models and select the model with the minimum value

② AIC (Akaike Information Criterion)

○ AIC = -2 ln (L) + 2p (where ln (L) is model fit, L is the likelihood function, p is the number of parameters)

○ An indicator showing the difference between the actual data distribution and the distribution predicted by the model

○ Lower values indicate better model fit

○ Becomes less accurate as the sample size increases

③ BIC (Bayesian Information Criterion)

○ BIC = -2 ln (L) + p ln n (where ln (L) is model fit, L is the likelihood function, p is the number of parameters, n is the number of data points)

○ Compensates for the inaccuracy of AIC as sample size increases

○ Penalizes more complex models more strongly as sample size increases



2. simple linear regression model 

⑴ definition: the case of a simple regression analysis in which the dependency is shown as a linear function 

⑵ representation of data


drawing

Figure. 1. simple linear regression model


drawing

① β0 : y intercept 

② β1 : slope or coefficient on X 

○ also called parameter, regression coefficient, weight, etc 

○ intuitively, elasticity means the degree to which the absolute value of the slope is large

○ in microeconomics, elasticity means the slope multiplied by (-1).

③ types of regression lines

○ population regression line: regression line obtained from the characteristics of the population

○ fitted regression line: regression line obtained from the characteristics of the sample

④ ui : residual

⑤ difference between residual and error


drawing

○ the errors mentioned in the future actually mean residuals

⑥ characteristic of variance 

homoscedasticity: VAR(ui | Xi) and Xi are independent. an impractical assumption. default setting for many statistical programs

heteroscedasticity **: VAR(ui | Xi) depends on Xi

○ a model with homoscedasticity is a good model 

⑶ assumptions 

assumption 1. Xi does not provide any information about the error


drawing

○ if there is a pattern on a residual plot, the model is not a good model 

assumption 2. (Xi, yi) is i.i.d. 

assumption 3. the existence of 4th order moment


drawing

⑷ induction of the fitted regression line  

method 1. method of moment estimator (MOM) or sample analog estimation 

○ calculation process


drawing

method 2. method of least squares or ordinary least squares (OLS)

○ definition: calculate the minimum value of sum of squares of the errors (SSE)

○ provided by all statistical softwares 

○ calculation process : if Xi is one-dimensional 


drawing

○ method of least squares is based on maximum likelihood estimation (assuming the residual has homoscedasticity and normality)


drawing

○ the regression of X to Y and the regression of Y to X are generally not the same

○ E(X2), E(XY), E(X), ect are involved in the regression of X to Y

○ E(Y2), E(XY), E(Y), etc are involved in the regression of Y to X

○ E(X2), E(Y2), ect make the asymmetry

method 3. cross entropy 

○ general definition 


drawing

○ binary classification


drawing

○ if y is represented as one-hot vector [0, ···, 1, ···, 0], the following is established 


drawing

characteristics of regression line 

① unbiasedness


drawing

② efficiency

○ Gauss-Markov theorem: OLS is efficient when homoscedasticity is satisfied  

③ consistency


drawing

④ asymptotic normality

○ slope


drawing

○ y intercept - heteroscedasticity-robust standard error


drawing

○ y intercept - homoscedasticity-robust standard error


drawing

⑹ evaluation of the regression line

criterion 1. linearity 

criterion 2. homoscedasticity : residual terms having equal variances

criterion 3. normality : residual terms following normal distribution

○ Box-Cox : In cases where it is difficult to assume normality in linear regression models, this method transforms the dependent variable to be closer to a normal distribution.

⑺ coefficient of determination : also called R-squared

① coefficient of determination R2

○ definition


drawing

○ SST : total variation

○ SSR : variation for regression equation

○ SSE : variation due to error

○ SSE is also called residual sum of squares (RSS), sum of squared residuals (SSR)

○ reason why the term is 0: because covariance of bias and chance error is intuitively 0

○ meaning

meaning 1. the proportion of the variance of Y that X can describe (no units)

meaning 2. the sum of squares described by the regression line ÷ the total sum of squares

② the coefficient of determination is the same as the square of the correlation coefficient


drawing

③ fraction of the variance unexplained (FVR)


drawing

④ characteristics 

○ 0 ≤ R2 ≤ 1

○ the closer R2 is to 1, the better the goodness of fit of regression line is

○ estimator of β1 = 0 ⇒ R2 = 0

○ R2 = 0 ⇒ estimator of β1 = 0 or Xi = constant


drawing

⑻ average error regression

① formula for SSE 


drawing

② the expected value of SSE


drawing

○ total degree of freedom = degree of freedom of residual + degree of freedom of regression line

○ total degree of freenom = n-1

○ degree of freedom of regression line = 1 ( there is only one regression variable)

○ degree of freedom of residual = n-2 

③ mean squared error (MSE)


drawing

④ standard error regression (SER)


drawing

example 1. the etymology of regression


drawing

Figure. 2. the etymology of regression


① X : a father’s height

② Y : a son’s height

③ E(X) = 67.7, E(Y) = 68.7, σX = 2.7, σY = 2.7, ρXY = 0.5 

E(Y | X = 80) = 74.85


drawing

E(Y | X = 60) = 64.85


drawing

⑥ conclusion

○ the son of a tall father tends to get shorter

○ the son of a short father tends to grow taller

○ finally, the son’s height tends to return to the average

○ however, since the above tendency is only based on expected value, the variance of sons’ generation’s height is not necessarily lower than the variance of fathers’ generation’s height 

example 2. predicting Y-values outside the range of independent variables: also called extrapolation

① generally, it is unadvisable to use extrapolation 


drawing

Figure 3. problems of extrapolation


② extrapolation methodology is not always wrong

○ example: research on biological evolution

Example problems for linear regression and bivariate normal distribution

⑿ Python code


from sklearn import linear_model 
reg = linear_model.LinearRegression() 
reg.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
# LinearRegression() 
reg.coef_ 
# array([0.5, 0.5])



3. multiple linear regression model  

⑴ definition: the case of a multiple regression analysis in which the dependency is shown as a linear function  

⑵ omitted variable bias 

① definition : a phenomenon in which the expected value of the error is not zero due to omitted variables


drawing

○ endogenous variable: variables correlated with ui  

○ exogenous variable: variables uncorrelated with ui  

condition 1. omitted variables and regressor (e.g., Xi) should have correlation

condition 2. omitted variables should be determinators of Y

④ the convergent value of the slope


drawing

○ ρXu > 0 : upward bias

○ ρXu < 0 : downward bias

⑤ if the value of the coefficient changes significantly when a new variable is added, it can be said that there is omitted variable basis

⑶ representation of data


drawing

① unbiasedness, consistency, and asymptotically jointly normal are observed in the above estimators 

② robustness : a characteristic that adding a new regressor does not significantly change any slope value of a regressor 

③ sensitivity : a characteristic that adding a new regressor significantly changes the slope value of a particular regressor 

⑷ assumptions

assumption 1. error is not explained by X1i,, ···, Xki

assumption 2.** (X1i, ···, Xki,Yi) is i.i.d.  

assumption 3. existence of 4th order moment


drawing

④ assumption 4. no perfect multicollinearity 

○ multicollinearity : a characteristic that the linear combination of one independent variable and another independent variable is highly correlated

○ (note) multiple linear regression model expects independent variables to be truly independent 

○ perfect multicollinearity: if one regressor has perfect linearity with the other regressors. the determinant value = 0

○ perfect multicollinearity is not the nature of a variable, but the nature of a data set


drawing

○ when you attempt a regression analysis on perfect multicollinear data, the number of possible coefficients is infinite: impossible to perform regression analysis 

○ imperfect multicollinearity : two or more regressors are just highly correlated

○ not a problem at once

○ the variance of the slope estimator is quite large → difficult to trust the slope estimator


drawing

Figure. 4. the reason why multi-collinearity increases the variance of the slope estimator
⒝ a variety of planes may exist within the significance interval


○ in general, a pair of variables should not have a correlation of more than 0.9

○ solution

○ draw pairwise plots of all combinations and remove highly correlated variables

○ PCA, weighted sum, etc. may be attempted, but each has its own shortcomings

○ (note) R Studio randomly ignores the last of the problematic terms when analyzing perfect multicollinearity data

⑸ OLS estimator: determine the coefficient by calculating the following simultaneous equations 


drawing

⑹ characteristics of the regression line


drawing

① unbiasedness

② consistency 

③ asymptotically jointly normality

④ Frisch-Waugh theorem 


drawing

⑺ adjusted R2

① drawbacks of R2 : the degree of fitting is not well reflected in multiple regression model

drawback 1. R2 always increases whenever a new regressor is added because the minimum value of SSE is reduced

drawback 2. high R2 does not verify the absence of omitted variable bias  

drawback 3. high R2 does not verify the current regressors are optimal 

○ to resolve drawback 1, adjusted R2 is introducted

② formula


drawing

③ characteristic 

○ adjusted R2 ≤ R2

○ adjusted R2 can be negative

○ The value decreases as inappropriate variables are added

⑻ standard error regression (SER): k is the number of independent variables in the regression equation


drawing

⑼ joint hypothesis: hypothesis when there are more than or equal to 2 constraints

idea 1. t1 and t2 are independent


drawing

idea 2. t1 and t2 have multicollinearity


drawing

③ general case


drawing

○ in general, heteroscedastic-robust F-statistics is used

○ many statistical programs take homoscedastic-robust F-statistics as default setting

④ null hypothesis 


drawing

⑽ redefinition of multiple linear regression model 


drawing

① H0 : if you want to test β1 = β2,  


drawing

② H0 : if you want to test β1 + β2 = 1, 


drawing

⑾ conditional mean independence

① definition


drawing

② X1i is not correlated with ui for given X2i 

③ β2 may not have consistency: but it is not important 

⑿ matrix notation

① linear regression model

○ for a scalar Y, column vector X, and β, 


drawing

○ generalization


drawing

② assumption

assumption 1. E(ui Xi) = 0 

assumption 2. (Xi, Yi), i = 1, ···, n is i.i.d. 

assumption 3. Xi and ui have nonzero finite fourth moment 

assumption 4. 0 < E(XiXit) < ∞, no perfect multicollinearity 

③ OLS modeling - a simple version


drawing

④ OLS modeling


drawing

⑤ consistency


drawing

⑥ multivariate central limit theorem


drawing

⑦ asymptotic normality


drawing

⑧ robust standard error (Eicker-Huber-White standard error)


drawing

⑨ robust F


drawing



Input: 2019.06.20 23:26

results matching ""

    No results matching ""