Chapter 20. Analysis of Variance of Regression Analysis
Higher category: 【Statistics】 Statistics Overview
1. ANOVA of simple linear regression analysis
2. ANCOVA
1. ANOVA of simple linear regression analysis
⑴ problem situation
| Age (year) | Number of mites |
|---|---|
| 3 | 5 |
| 6 | 13 |
| 9 | 16 |
| 12 | 14 |
| 15 | 18 |
| 18 | 23 |
| 21 | 20 |
| 24 | 32 |
| 27 | 29 |
| 30 | 28 |
Table 1. ANOVA of simple linear regression analysis
⑵ table of t statistic
| factor | coefficient | standard error | t | significance |
|---|---|---|---|---|
| Intercept | 5.733 | 2.265 | 2.531 | 0.035 |
| Age | 0.853 | 0.122 | 7.006 | 0.001 |
Table 2. table of t statistic
![]()
⑶ table of F statistic
| factor | sum of squares | df | mean square | F | significance |
|---|---|---|---|---|---|
| Regression | 539.648 | 1 | 539.648 | 49.086 | < 0.001 |
| Residual | 87.952 | 8 | 10.994 | ||
| Total | 627.600 | 9 |
Table 3. table of F statistic
① null hypothesis H0 : the slope of the regression line is equal to zero
② idea : if MS of regression > MS of residual, the slope of the regression line is not zero
③ calculation
![]()
④ the reason why the degree of freedom of the regression line is 1: because there is only one regression variable
⑷ the reason why F statistic shows greater power than t statistic : p value of F statistic is smaller
① F statistic uses more information than t statistic
② F statistics shows greater power due to the effect of larger sample sizes
2. ANCOVA
⑴ overview
① concept of fusing simple linear regression analysis with one-way ANOVA
② necessity : in actual problem situations, the second factor changes due to the confounding effect of one factor, which can affect the dependent variable
③ difference from two-way ANOVA: ANCOVA technique does not compete with certain ANOVA techniques. ANOVA and ANCOVA can be performed at the same time
⑵ problem situation
① independent variable: whether it is a contaminated mine area or not
② dependent variable: lead concentration in organs of the rats
③ confounding effect: age
⑶ table of results without considering age
| factor | sum of squares | degree of freedom | mean square | F ratio |
|---|---|---|---|---|
| Treatment | SS Treatment | k-1 | MS Treatment = SS Treatment / (k-1) | F = MS Treament / MS Error |
| Error | SS Error | N-k | MS Error = SS Error / (N-k) | |
| Sum | SS Total | N-1 |
Table 4. table of results of simple one-way ANOVA
① if the age effect is not controlled, the residuals become larger
② as the residuals increase, MS Error increases and F ratio decreases
③ as F ratio decreases, the power decreases: in other words, it is difficult to prove the significance of the treatment
⑷ assumptions
① homoscedasticity
② independency
③ normality
④ the relationship between a covariate and the dependent variable should be linear
⑤ parallelism
○ for example, when calculating the regression line for each contaminated mining area and non-contaminated area, the slopes will be the same
○ satisfying parallelism means the same thing as no interaction
○ if parallelism is not satisfied, comparing differences for one selected value (e.g., the overall mean of age) cannot represent the entire range of covariatess
![]()
Figure 1. example of lack of parallelism in ANCOVA
○ before ANCOVA, the interaction of age and region should be evaluated to confirm parallelism
⑸ procedure
① 1st. confirm the correlation between age and lead concentration
![]()
Figure 2. correlation between age and lead concentration
② 2nd. confirm that the interaction of age and region is not statistically significant
| factor | sum of squares | degree of freedom | mean square | F ratio | p value |
|---|---|---|---|---|---|
| Age | |||||
| Site | |||||
| Age × Site | NS | ||||
| Error | |||||
| Total |
Table 5. table of results including interaction term
③ 3rd. calculate the regression line of lead concentration according to age
![]()
Figure 3. regression line of lead concentration according to age
④ 4th. calculate two regression lines satisfying the following conditions from the regression line obtained from 3rd step
![]()
Figure 4. calculation of two regression lines
○ for the regression line obtained from the 3rd step, change only the y intercept while maintaining the slope of the regression line
○ minimize the sum of squares for each independent variable’s level
⑤ 5th. calculate the residuals from each regression line obtained from 4th step
![]()
Figure 5. calculation of residuals
⑥ 6th. after calculating the average age for the entire group, the function value of each regression line for that value is designated as the standard value
○ the average age for the entire group is just an example and it doesn’t matter what value it is.
![]()
Figure 6. calculation of standard value
⑦ 7th. mark the residuals obtained from 5th step up and down at the standard value for each treatment group
![]()
Figure 7. final result
⑧ 8th. finally, you can see that SS Error is smaller: p value is smaller
![]()
Figure. 8. comparison of results
⑹ result
① correction result
![]()
Figure 9. lead concentration in contaminated mine areas
![]()
Figure 10. lead concentration in non-contaminated mine areas
② table of results before correction
| factor | sum of squares | degree of freedom | mean square | F ratio | p value |
|---|---|---|---|---|---|
| Site | 320 | 1 | 320 | 2.74 | 0.115 |
| Error | 2100.8 | 18 | 116.71 | ||
| Total | 2420.800 | 19 |
Table 6. table of results before correction
③ table of results after correction: the sum of squares for Age can be calculated from the regression line
| factor | sum of squares | degree of freedom | mean square | F ratio | p value |
|---|---|---|---|---|---|
| Age | 1776.290 | 1 | 1776.290 | 93.054 | < 0.001 |
| Site | 1094.335 | 1 | 1094.335 | 57.329 | < 0.001 |
| Error | 324.510 | 17 | 19.089 | ||
| Total | 2420.800 | 19 |
Table 7. table of results after correction
④ report example : “A preliminary analysis for parallelism showed no significant difference between the slopes of the lines for lead concentration in relation to age (age × site: F1,16 = 0.00, NS). The subsequent ANCOVA showed a significant effect of site (F1,17 = 57.329, P < 0.001) as well as a significant effect of the covariate (age) (F1,17 = 93.054, P < 0.001). Rats from the mine site had higher levels of lead than those from the control.
⑺ the reason for not comparing the y-intercept of the regression line in the contaminated mine area with that of the regression line in the other area
① situation : if parallelism is satisfied, it is much easier to compare the y-intercepts
② comparing y-intercepts is similar to comparing sample groups with only one sample
③ because ANCOVA takes the total error terms, ANCOVA is similar to comparing sample groups with twice as many elements as the size of the given sample group
④ therefore, performing ANCOVA has higher power than simply comparing y-intercepts
⑻ application 1. 2-factor ANCOVA
① applying the ANCOVA technique when analyzing 2-way ANOVA
② example: when the independent variables are gender and drug treatment, the dependent variable is blood pressure, and the confounding factor is age
⑼ application 2. if there are multiple confounding factors
① multiple linear regression analysis is used
② advanced regression analysis may also be used
Input: 2019.12.07 23:04