Chapter 18. Advanced Regression Analysis
Higher category : 【Statistics】 Statistics Overview
1. validity
2. panel data
4. randomized controlled experiment
1. validity
⑴ internal validity
① definition : qualitative evaluation of whether each coefficient obtained as a result of regression analysis is reasonably calculated
② threat 1. omitted variable bias
○ definition: if there are variables that satisfy the following two conditions, the expected value of the residual is not zero
○ condition 1. the omitted variable correlates with one or several existing variables
○ condition 2. the omitted variable must be the determinator of Y
○ example of the expected value of the residual
○ solution
○ include omitted variables in regression analysis
○ if there is no data related to the omitted variable, the following three methods exist:
○ method 1. panel data: remove properties that do not change over time
○ method 2. instrumental variable regression: only essence information can be extracted through instrumental variables
○ method 3. collecting new information under randomized controlled experiment
③ threat 2. wrong functional form bias
○ definition: bias arising from linear regression analysis in nonlinear relationships
○ a kind of omitted variable bias
④ threat 3. errors-in-variable bias or measurement error in the regressors
○ definition : an independent variable with measurement error, X̃i, can be correlated with error vi
○ formula
○ issue 1. the iron law of econometrics: the OLS estimator of slope tends to be lower than the true value
○ issue 2. OLS estimator doesn’t have consistency
○ issue 3. statistical estimation is inaccurate
○ solution
○ method 1. improvement of accuracy of measuring instruments
○ method 2. instrumental variable regression: only essence information can be extracted through instrumental variables
○ method 3. error correction: correction is possible if there is a pattern of error
○ (note) if there is a measurement error in the dependent variable
○ formula
○ the estimator of the slope does not change
○ satisfying the three major assumptions of a simple linear regression model
○ assumption 1. Xi does not provide any information on vi
○ assumption 2. Xi and Ỹi are i.i.d.
○ as Yi and wi are i.i.d. and mutually independent, Ỹi is i.i.d.
○ as Xi is independent with Yj or wj for i ≠ j, Xi and Ỹi are independent
○ therefore, the assumption 2 is satisfied
○ assumption 3. existence of 4th order moment
○ because ui and wi have finite 4th order moments and mutually independent, vi = ui + wi has a finitie 4th order moment
○ thus, (Xi, vi) has non-zero finite 4-th moment
○ there are three differences between errors-in-variable bias
○ difference 1. OLS estimator has consistency
○ difference 2. statistical estimation is accurate
○ difference 3. increases variance of regression errors → increases the variance of OLS estimator
⑤ threat 4. sample selection bias
○ when bias occurs in the data selection process
○ in other words, bias is generated by deducing the characteristics of the entire group from a part
○ example 1. recruitment rate for factor A and factor B
○ assume that the recruitment rate increases as A and B increase
○ people with low A factor do not want to apply
○ among those with low A factor, those with high B factor apply
○ as a result, the employment rate regression curve for A factor measures the effect on A factor lower than the actual one
⑥ threat 5. simultaneous causality bias
○ it is natural that there is a casual link from the independent variable to the dependent variable
○ if there is a causal link from the dependent variable to the independent variable, bias occurs in the coefficient of the independent variable
○ it is as if the feedback circuit is expressed in a complex formula
○ positive feedback circuit : increases the absolute value of the coefficients
○ negative feedback circuit : decreases the absolute value of the coefficients
○ example : birth rate and mortality rate have a mutual causal relationship. similar to a positive feedback circuit
○ solution
○ method 1. instrumental variable regression: extracts only the essence information that has been removed from the causal link
○ method 2. randomized controlled experiment: eliminate causality of dependent variables by randomly performing the treatment
⑵ external validity
① definition : a qualitative evaluation of whether the coefficients for each independent variable obtained from regression analysis are applicable to other populations
② threat 1. non-representative sample: difference in populations themselves
③ threat 2. non-representative program or policy: difference in system
○ different systems can violate external validity even if the population is the same
○ example : difference in educational environment, difference in laws and institutions, difference in physical environment, etc
④ threat 3. general equilibrium effect
○ definition: treatment changes the overall environment, which can amplify or suppress the effectiveness of treatment
○ similar to simultaneous causality bias
○ example : effect of the existence of oil fields on income
○ existence of oil fields → increase in workers’ income
○ increase in workers’ income → increase in the inflow of new workers
○ increase in home purchase → increase in housing prices due to lack of housing → decrease in income
○ increased car congestion → factors of income reduction
○ increased demand for increased restaurant quality due to increased income → increase in dining out costs → factor of decreasing income
⑤ solutions
○ method for adjusting the conclusions of the regression relationship according to population and setting
○ meta-analysis: comparing conclusions of similar but not identical populations
2. panel data
⑴ overview
① referring to the following data
② balanced panel data : all entities are equipped in all time intervals
③ unbalanced panel data : if it is not balanced panel data
④ (comparison) repeated cross-sectional data
○ panel data is data tracked for each individual
○ repeated cross-sectional data is data obtained over time
○ even repeated cross-sectional data can include the same person in the before and after data and is cheap
⑵ before and after regression model
① formula
○ this model can remove constant elements over time
○ Z is different from intercept because it has different values depending on i
② a kind of fixed effect regression model
⑶ fixed effect regression model
① major assumptions
○ assumption 1. E(uit Xi1, ···, XiT, αi)</span> = 0: it is not sufficient that E(uit Xit, αi) = 0 (∵ information of all time is used for the average value of y and u)
○ assumption 2. (Xi1, ···, XiT, ui1, ···, uiT) is i.i.d. under joint distribution : in other words, it does not mean that cov(uit, uis) = 0 (assuming t ≠ s)
○ assumption 3. existence of 4th order moment
○ assumption 4. no perfect multicollinearity : Xit must depend on t
○ under the majore assumptions, fixed effect estimator satisfies consistency and asymptotic normality
○ even if n increases to infinite, the average of Y on time does not satisfy consistency and normality (∵ n and T are irrelevant)
② formula
○ data should be comprehended as a table represented on the axes of i and t
○ in the case of T = 2, the situation is the same with before and after regression model
○ the standard deviation of slope = clustered standard error = heteroscedasticity & autocorrelation consistent standard error (HAC)
○ there are not a total of T regression lines up to t = 1, · · · and T. it’s just one regression line
○ it is not for β1, t but β1
③ an example of algorithm
data <- read.csv("C:/Users/sun/Desktop/Guns.csv", header = T) attach(data) y <- data[, 2] y <- log(y) x1 <- data[, 13] x2 <- data[, 5] x3 <- data[, 11] x4 <- data[, 10] x5 <- data[, 9] x6 <- data[, 6] x7 <- data[, 7] x8 <- data[, 8] state_y <- array(dim = 56) state_x1 <- array(dim = 56) state_x2 <- array(dim = 56) state_x3 <- array(dim = 56) state_x4 <- array(dim = 56) state_x5 <- array(dim = 56) state_x6 <- array(dim = 56) state_x7 <- array(dim = 56) state_x8 <- array(dim = 56) for(i in 1:56){ if(i != 3 && i != 7 && i != 14 && i != 43 && i != 52){ data_sub <- data[stateid == i, ] state_y[i] <- mean(data_sub[, 2]) state_x1[i] <- mean(data_sub[, 13]) state_x2[i] <- mean(data_sub[, 5]) state_x3[i] <- mean(data_sub[, 11]) state_x4[i] <- mean(data_sub[, 10]) state_x5[i] <- mean(data_sub[, 9]) state_x6[i] <- mean(data_sub[, 6]) state_x7[i] <- mean(data_sub[, 7]) state_x8[i] <- mean(data_sub[, 8]) } } Y <- array(dim = 1173) X1 <- array(dim = 1173) X2 <- array(dim = 1173) X3 <- array(dim = 1173) X4 <- array(dim = 1173) X5 <- array(dim = 1173) X6 <- array(dim = 1173) X7 <- array(dim = 1173) X8 <- array(dim = 1173) for(i in 1 : dim(data)[1]){ j <- data[i, 12] Y[i] <- y[i] - state_y[j] X1[i] <- x1[i] - state_x1[j] X2[i] <- x2[i] - state_x2[j] X3[i] <- x3[i] - state_x3[j] X4[i] <- x4[i] - state_x4[j] X5[i] <- x5[i] - state_x5[j] X6[i] <- x6[i] - state_x6[j] X7[i] <- x7[i] - state_x7[j] X8[i] <- x8[i] - state_x8[j] } RELATION <- lm(Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8) summary(RELATION)
④ when a result of applying the fixed effect regression model shows a significantly different conclusion from the original result
○ strong implication that there was an omitted variable bias in the original model
⑤ even if the main assumptions are satisfied, there can be autocorrelation
○ autocorrelation: uit and uit* (t ≠ t*) also have serial correlation. associated with HAC
○ the case without autocorrelation
○ the case with autocorrelation
○ the proof of cov(vit, vis) = 0 (assuming k ≠ s) in the case without autocorrelation
⑷ matrix notation of fixed effect regression model
① modeling
② assumption
③ fixed effects estimator
④ consistency
⑤ asymptotic normality
⑸ least squares dummy variables model (LSDV)
① formula
② the reason why D1i is not included: to avoid perfect multi-collinearity
○ formula
○ if the coefficient exists, γ1 cannot be specified
○ perfect multicollinearity caused by dummy variables is also called dummy variable trap
③ cannot perform the regression analysis if range of i (i.e., n) is too large: because there are too many regression variables
⑹ time effect
① time effect term: marked as λt
② modeling
③ an example of algorithm
data <- read.csv("C:/Users/sun/Desktop/Guns.csv", header = T) attach(data) # definition y <- data[, 2] y <- log(y) x1 <- data[, 13] x2 <- data[, 5] x3 <- data[, 11] x4 <- data[, 10] x5 <- data[, 9] x6 <- data[, 6] x7 <- data[, 7] x8 <- data[, 8] # elimination of fixed state effects state_y <- array(dim = 56) state_x1 <- array(dim = 56) state_x2 <- array(dim = 56) state_x3 <- array(dim = 56) state_x4 <- array(dim = 56) state_x5 <- array(dim = 56) state_x6 <- array(dim = 56) state_x7 <- array(dim = 56) state_x8 <- array(dim = 56) for(i in 1:56){ if(i != 3 && i != 7 && i != 14 && i != 43 && i != 52){ data_sub <- data[stateid == i, ] state_y[i] <- mean(data_sub[, 2]) state_x1[i] <- mean(data_sub[, 13]) state_x2[i] <- mean(data_sub[, 5]) state_x3[i] <- mean(data_sub[, 11]) state_x4[i] <- mean(data_sub[, 10]) state_x5[i] <- mean(data_sub[, 9]) state_x6[i] <- mean(data_sub[, 6]) state_x7[i] <- mean(data_sub[, 7]) state_x8[i] <- mean(data_sub[, 8]) } } Y <- array(dim = 1173) X1 <- array(dim = 1173) X2 <- array(dim = 1173) X3 <- array(dim = 1173) X4 <- array(dim = 1173) X5 <- array(dim = 1173) X6 <- array(dim = 1173) X7 <- array(dim = 1173) X8 <- array(dim = 1173) for(i in 1 : dim(data)[1]){ j <- data[i, 12] Y[i] <- y[i] - state_y[j] X1[i] <- x1[i] - state_x1[j] X2[i] <- x2[i] - state_x2[j] X3[i] <- x3[i] - state_x3[j] X4[i] <- x4[i] - state_x4[j] X5[i] <- x5[i] - state_x5[j] X6[i] <- x6[i] - state_x6[j] X7[i] <- x7[i] - state_x7[j] X8[i] <- x8[i] - state_x8[j] } # elimination of fixed time effects time_Y <- array(dim = 23) time_X1 <- array(dim = 23) time_X2 <- array(dim = 23) time_X3 <- array(dim = 23) time_X4 <- array(dim = 23) time_X5 <- array(dim = 23) time_X6 <- array(dim = 23) time_X7 <- array(dim = 23) time_X8 <- array(dim = 23) for(t in 77:99){ data_sub2 <- data[year == t, ] time_Y[t - 76] <- mean(data_sub2[, 2]) - mean(state_y, na.rm = TRUE) time_X1[t - 76] <- mean(data_sub2[, 13]) - mean(state_x1, na.rm = TRUE) time_X2[t - 76] <- mean(data_sub2[, 5]) - mean(state_x2, na.rm = TRUE) time_X3[t - 76] <- mean(data_sub2[, 11]) - mean(state_x3, na.rm = TRUE) time_X4[t - 76] <- mean(data_sub2[, 10]) - mean(state_x4, na.rm = TRUE) time_X5[t - 76] <- mean(data_sub2[, 9]) - mean(state_x5, na.rm = TRUE) time_X6[t - 76] <- mean(data_sub2[, 6]) - mean(state_x6, na.rm = TRUE) time_X7[t - 76] <- mean(data_sub2[, 7]) - mean(state_x7, na.rm = TRUE) time_X8[t - 76] <- mean(data_sub2[, 8]) - mean(state_x8, na.rm = TRUE) } YY <- array(dim = 1173) XX1 <- array(dim = 1173) XX2 <- array(dim = 1173) XX3 <- array(dim = 1173) XX4 <- array(dim = 1173) XX5 <- array(dim = 1173) XX6 <- array(dim = 1173) XX7 <- array(dim = 1173) XX8 <- array(dim = 1173) for(i in 1 : dim(data)[1]){ j <- data[i, 1] YY[i] <- Y[i] - time_Y[j - 76] XX1[i] <- X1[i] - time_X1[j - 76] XX2[i] <- X2[i] - time_X2[j - 76] XX3[i] <- X3[i] - time_X3[j - 76] XX4[i] <- X4[i] - time_X4[j - 76] XX5[i] <- X5[i] - time_X5[j - 76] XX6[i] <- X6[i] - time_X6[j - 76] XX7[i] <- X7[i] - time_X7[j - 76] XX8[i] <- X8[i] - time_X8[j - 76] } RELATION <- lm(YY ~ XX1 + XX2 + XX3 + XX4 + XX5 + XX6 + XX7 + XX8) summary(RELATION)
⑺ time effect regression using dummy variables
① formula
② the reason why B1t is not included : to avoid perfect multi-collinearity
○ formula
○ if the coefficient exists, δ1 cannot be specified
○ perfect multi-collinearity caused by dummy variables is also called dummy variable trap
3. instrumental variable
⑴ definition: a method of separating only the essence information of the regression variable using the third variable
⑵ simple expression
① modeling
○ if there is one regression variable
○ if there are multiple regression variables
○ endogenous variable : a variable correlated with ui
○ exogenous variable : a variable that is not correlated with ui
○ exactly identified : m = k
○ over-identified : m > k
○ under-identified : m < k
○ unable to perform modeling in under-identified : it means that there should be a lot of instrumental variables
○ the reason why W is included : it is useful when it is difficult to find a Z that meets the criteria
② assumptions for using instrumetnal variables
○ assumption 1. E(ui | W1i, ···, Wri) = 0
○ assumption 2. (X1i, ···, Xki, W1i, ···, Wri, Z1i, ···, Zmi, Yi) is i.i.d.
○ assumption 3. all variables have finite 4th order moment
○ assumption 4. instrumental variable effectiveness
○ 4-1. instrument relevance
○ 4-2. instrument exogeneity
○ 4-3. no perfectly collinearity
○ if the assumptions are satisfied, the TSLS estimator satisfies consistency and asymptotic normality
③ procedure
○ if there is one regression variable
○ 1st. perform regression analysis of Xi by using instrumental variable Zi
○ 2nd. calculate the estimator of Xi
○ 3rd. perform regression analysis of Yi by using the estimator of Xi
○ if there are multiple regression variables
○ 1st. perform regression analysis of Xi by using instrumental variable Zi : for ℓ = 1, ···, k,
○ 2nd. calculate the estimator of Xi : for ℓ = 1, ···, k,
○ 3rd. perform regression analysis of Yi by using the estimator of Xi : for ℓ = 1, ···, k,
○ doing OLS regression twice may miscalculate the standard error
④ two-step least squares (TSLS) estimator
○ formula
○ proof
○ (note) if Zi := Xi, the TSLS estimator of β1 is the same with the OLS estimator of β1
⑤ consistency
⑥ asymptotic normality
⑶ supplement of instrumental variable effectiveness
① instrument relevance
○ formula
○ weak instrumental variable: the case that the instrumental variable is not sufficiently correlated with the regression variable. the estimaor shows very strange values
○ test of instrumental variable strength
○ when calculating 1st stage F statistic, if F is bigger than 10 the instrumental variable is strong
○ available only in homoskedasticity
○ W1i, ···, Wri have nothing to do with the strength of an instrumental variable
② instrument exogeneity
○ formula
○ u must be specified to know the instrument exogenuity
○ over-identifying restrictions test
○ when the following statistics are calculated, J follows a chi-squared distribution with a degree of freedom of m-k
○ null hypothesis for J, H0 : the proposition that instrumental variables are exogenous
○ logic is similar to instrument relevance : if the F statistic is small, it means that there is no correlation (all coefficients 0)
○ available only in homoskedasticity : many statistical programs also offer heteroskedasticity-robust J-test
○ unable to determine which instrumental variable is endogenous when rejecting null hypothesis
○ meaning of the degree of freedom in J statistic
○ k instumental variables are used to make residuals: they correspond to k endogenous variables
○ the remaining m-k instrumental variables are used to test the correlation with the residuals
○ unable to apply J test in the case of exactly identified because there are no instumental variables to be used in correlation relationship analysis: J statistic is always zero in this case
③ no perfect collinearity
⑷ matrix notation
① modeling
○ Xi and Zi may overlap
② assumptions
○ Yi = Xitβ + ui
○ (Yi, Xi, Zi), i = 1, ···, N is i.i.d.
○ E(ui | Zi) = 0
○ E(ZiXit), E(ZiZit) have inverse matrices
○ Zi, ,Xi, and ui have finite 4th order moments
③ procedure
○ 1st. perform regression analysis of Xi by using instrumental variable Zi
○ 2nd. calculate the estimator of Xi
○ 3rd. perform regression analysis of Yi by using the estimator of Xi
④ estimator
⑤ consistency
⑥ asymptotic normality
⑦ estimator of the variance of the normal distribution
⑸ exploration of instrumental variables: the exploration is in the realm of art
① Joshua Angrist (MIT)
② Steven Levitt (Chicago): published “Freakonomics”
③ Daron Acemoglu (MIT): published “Why Nations Fail”
4. randomized controlled experiment
⑴ overview
① definition: randomly extracting subjects from the population and then dividing the groups randomly again to perform different treatments
② randomized controlled experiments are rare in econometrics
③ randomized controlled experiments can remove omitted variable bias: 100% validity is not guaranteed
④ it provides standard for which to judge causality
⑵ formula
① simple model
② model including additional regression variables
③ reason for adding additional regression variables
○ reason 1. randomization check
○ regardless of whether additional regression variables are present or not, β1 is consistent
○ it was not random if β1 changed significantly depending on the presence or absence of additional regression variables
○ reason 2. efficiency: if there are additional regression variables, the variance is smaller
○ reason 3. conditional randomization
○ depending on the individual characteristics of a person, it may not be random even if it appears to be randomly extracted
○ random sampling with additional regression variables fixed can minimize such concerns
○ the following conditional independence must be satisfied for the β1 estimator to be consistent: a weaker conditon than independence
○ interaction: treatment effect can depend on W
⑶ threats to internal validity
① failure to randomize
○ not only the treatment effect but also the non-random assignment effect appears
○ hypothesis test: if the coefficients are all zero when performing regression analysis of pre-treatment characteristics of W1i, ···, Wri by using Xi, the experiment can be regarded as a randomized experiment
○ example: if random processing is performed by name, a specific ethnic groups may be assigned to the processing group preferentially
② failure to follow treatment protocol (partial compliance)
○ definition: even though random processing works well, subjects may not comply with the protocol well
○ due to this, Xi can be correlated with ui
○ randomized encourage design : partial complieance can be identified if the random treatment is the instrumental variable and the real treatment is analyzed under instrumental variable regression
③ attrition
○ definition: excluding subjects for reasons related to treatment after random sampling
④ Hawthorne effect
○ definition : the subject’s knowledge of what experiments he or she is carrying out can affect the results of the experiment
○ in a new drug research, double blind test can be used to avoid this issue
○ difficult to perform double-blind test in econometrics
⑤ small sample
○ because human-related research is expensive, the sample size is small
○ many statistical estimations are based on asymptotic normality
○ if the sample size is small, the sample should not be estimated by the normal distribution
⑷ threats to external validity
① non-representative sample
○ general econometric experiments target undergraduate volunteers
○ volunteers are more motivated and can be overestimated in terms of meauring effects
② non-representative program or policy
○ the experimental program or policy should be similar to the actual one
○ example: the experimental program is performed for a short period of time. real-life areas of curiosity may require longer time
③ general equilibrium effect
○ definition: treatment changes the overall environment, which can amplify or suppress the effectiveness of treatment
○ small experiments do not reflect changes in the environment, so external validity must be considered separately
5. quasi-experiment
⑴ definition
① an experiment in which an independent variable is not under the control of a researcher and is conducted in a natural situation
② also known as natural experiment
③ objective: program evaluation
⑵ method 1. differences-in-differences (DID) estimator
① the simplest model (assuming panel data)
② model with additional regression variables (assuming panel data): because conditions may change between before and after data
③ criterion for repeated cross-sectional data
⑶ method 2. instrumental variable regression
① 1st. define Zi as a regression variable in a randomized controlled experiment
② 2nd. Zi is a good instrumental variable for Xi: instrument relevance is satistifed
③ 3rd. Yi is the result of interest.
④ 4th. evaluate the effect of Xi on Yi with Zi as an instrumental variable
⑷ method 3. regression discontinuity design (RDD)
① overview
○ if you set the threshold (cut-off) ω0, the data near the threshold may be similar
○ when data near the threshold is processed differently, the following difference can be completely seen as a treatment effect
○ it is a very popular experimental technique
○ disadvantage : difficult to apply a regression discontinuity design to the outlier
② sharp regression discontinuity design
③ fuzzy regression discontinuity design
○ the experiment may not be tested as smoothly as Xi defined in the sharp regression discontinuity design
○ the following instrumental variable Zi can be a good instrumental variable on actual Xi
⑸ threats to internal validity
① failure to randomize
○ not only the treatment effect but also the non-random assignment effect appears
○ hypothesis test : if the coefficients are all zero when performing regression analysis of pre-treatment characteristics of W1i, ···, Wri by using Xi, the experiment can be regarded as a randomized experiment
○ example : if random processing is performed by name, a specific ethnic group may be assigned to the processing group preferentially
② failure to follow treatment protocol (partial compliance)
○ definition : even though random processing works well, subjects may not comply with the protocol well
○ due to this, Xi can be correlated with ui
○ randomized encourage design: partial compliance can be identified if the random treatment is the instrumental variable and the real treatment is analyzed under instrumental variable regression
③ attrition
○ definition: excluding subjects for reasons related to treatment after random sampling
④ Hawthorne effect
○ no reason to be cautious about the Hawthorne effect in a quasi-experiment: because it is a natural experiment
⑤ instrumental variance effectiveness
○ instrument relevance can be evaluated through data
○ instrument exogenity may not be established even if the instrumental variable appears to be randomly assigned
○ example: Xi and ui may have a correlation while inducing low-numbered people to act to avoid conscription even if researchers want to see income according to the number of draft lottery
⑹ threats to external validity
① non-representative sample
② non-representative program or policy
③ general equilibrium effect
⑺ criticism
① attempts are performed to find good variables in quasi-experiments
② there aren’t that many really good quasi-experiments
6. heterogeneous population
⑴ definition: the case that the coefficients of regression line β0i , β1i are not constants but vary according to to sample
① β1i : heterogenous effect of Xi
② the parameter of interest is E(β1i)
③ if β1i is observable, models using interaction can be used
④ if β1i is unobservable, it is analyzed as follows
⑵ OLS
① assumption : Xi should be random → Xi and (ui, β0i, β1i) should be independent
○ conditions that are difficult to satisfy in practice
② formula
⑶ instrumental variables estimation (IV)
① assumption : Zi should be random → Zi and (ui, vi, β0i, β1i, π0i, π1i) should be independent
② formula
○ E(β1iπ1i) / E(π1i) is called local average treatment effect (LATE)
③ conditions for equalizing LATE and ATE
○ case 1. β1i = β1 = constant: no heteroscedasticity is required
○ case 2. π1i = π1 : no heteroscedasticity in instumental variable
○ case 3. β1i and π1i are independent
④ connotations
○ difficult to evaluate instrument exogenuity
○ J-test only tells the difference between LATEs
Input: 2019.11.26 10:29