Summary of Statistics
Recommended Articles : 【Statistics】 Table of Contents
1. Data, Information, Knowledge
⑴ Data : Given data
⑵ Information : Name of the data
⑶ Knowledge : Relationship between information and information
2. Ratio Scale, Interval Scale, Ordinal Scale, Nominal Scale
⑴ Ratio Scale : Absolute zero point exists. Ratio concept exists. Absolute zero, etc.
⑵ Interval Scale : Absolute zero point does not exist. Ratio concept does not exist. Celsius temperature, etc.
⑶ Ordinal Scale : Order concept
⑷ Nominal Scale : Gender, etc.
3. Difference between Hypothetico-Deductive Method and Data Science
⑴ Hypothetico-Deductive Method sets a hypothesis first and then conducts experiments
⑵ Data Science conducts experiments first and then sets hypotheses
4. Accuracy and Precision
⑴ Accuracy is a concept about how close the sample mean is to the population mean
⑵ Precision is a concept about how small the variance of the sample is
5. Confounding Effect
⑴ When a third factor affects both the manipulated variable and the dependent variable
⑵ Why correlation does not imply causation
6. Meaning of Batch Effect
⑴ Incorrect repeated experiments
⑵ When control variables applied to batches are not properly controlled, it leads to incorrect statistical conclusions
7. Meaning of Observational Experiment and Experimental Experiment
⑴ Observational Experiment is testing hypotheses without changing conditions. Applied in data science.
⑵ Experimental Experiment is testing hypotheses by changing conditions. Applied in general scientific methodology.
8. Difference between Replication Experiment and Replication Measurement
⑴ Replication Experiment is related to accuracy. Testing multiple patients in drug testing, for example.
⑵ Replication Measurement is related to precision. Testing the same patient multiple times in drug testing, for example.
9. Comparison between Mean and Median
⑴ Left is median, right is mean
⑵ The areas on both sides should be equal with respect to the median
10. Quantile vs. Quantile Plot
11. Meaning of Independence
⑴ One variable provides no information about another variable
⑵ P(X = x, Y = y) = P(X = x) × P(Y = y)
12. Central Limit Theorem
Regardless of the population distribution, the distribution of sample means follows a normal distribution
13. Chi-Square Goodness-of-Fit Test
14. Chi-Square Independence Test
15. Characteristics of the t-Distribution
⑴ T-distribution is fatter-tailed compared to the standard normal distribution
16. Considerations when Formulating the Alternative Hypothesis
⑴ There should be potential for refutation
17. Parametric Testing vs. Non-Parametric Testing
⑴ Parametric Testing
① Generally when the distribution of samples follows a normal distribution
② Calculation of p-value through parameters
⑵ Non-Parametric Testing
① Generally when the distribution of samples does not follow a normal distribution
② Calculation of p-value without parameters
18. Reasons for One-Tailed Testing
⑴ Situation : When there is confidence that it is not on one side
⑵ Advantages : Can reduce Type II errors
① Type II Error : Rejecting the null hypothesis when the alternative hypothesis is true
② Possibility of obtaining the desired conclusion from the experimenter’s perspective
⑶ Disadvantage 1. Can lead to incorrect statistical conclusions: p-value is underestimated
⑷ Disadvantage 2. Need to convince oneself
19. Experimental Design **: Shoe Experiment
⑴ Left shoe and right shoe have significant differences in shape and usage, how to compare them is the issue
20. Experimental Design
⑴ Problem : Whether to experiment on the genetic diversity group or the genetic unified group
⑵ Answer : Genetic diversity group
① In practice, it is the genetic diversity group : Issues with the effectiveness of conclusions drawn from the genetic unified group
② Significant conclusions can be obtained through post-tests : Subsequent application of the hypothetico-deductive method
21. Power of the Test
⑴ Increasing the power means using statistical techniques where the p-value is smaller while α is constant
⑵ Example 1. T-distribution has higher power with higher degrees of freedom
① Higher degrees of freedom make the t-distribution more similar to the normal distribution
② Higher degrees of freedom result in a narrower t-distribution, leading to higher power
⑶ Example 2. Two-sample t-test has higher power than paired sample t-test
① Paired sample t-test : Essentially one variable. Degrees of freedom are n-1
② Two-sample t-test : Two variables. Degrees of freedom are n+m-2
③ Increasing degrees of freedom in the t-distribution leads to higher power, so two-sample t-test has higher power
⑷ Example 3. In two-sample t-test, having the assumption of equal variances results in higher power compared to when there is no assumption of equal variances
① When there is the assumption of equal variances, degrees of freedom
② When there is no assumption of equal variances, degrees of freedom
⑸ Example 4. In regression analysis, F-test has higher power than t-test
⑹ Example 5. Performing ANCOVA on data that satisfies the parallelism assumption has higher power than comparing y-intercepts of each regression line in data
① Comparing y-intercepts has the sample size level of one sample group
② ANCOVA calculates with the entire error term, so the sample size level is twice that of one sample group
22. Reasons Not to Conduct Pairwise t-Tests When There Are Multiple Groups
Due to the accumulation of Type I errors, even groups with little difference are likely to be concluded as having differences
23. Assumptions of ANOVA Analysis
⑴ Normality : All data are sampled from populations that follow normal distributions
⑵ Independence : All data are independently sampled from populations
⑶ Homoscedasticity : All data, regardless of different means, are sampled from populations with the same variance
24. Meaning of Robustness
Even in cases of heteroscedasticity or non-normality, statistical conclusions (accepting or rejecting null hypothesis) do not change with a large sample size or repeated measures within categories
25. Assumptions of Linear Correlation
⑴ Randomly sampled data
⑵ Each variable is sampled from populations that follow normal distributions
⑶ The relationship appears linear
26. Difference between Correlation and Regression Analysis
⑴ Correlation simply indicates the degree of the relationship between variables
⑵ Regression analysis indicates the causal relationship between independent variables and the dependent variable. Since the goal is prediction, it does not necessarily prove the actual causal relationship
27. Assumptions of Regression Model
⑴ Y values are assumed to be measured from populations that satisfy normality and homoscedasticity
⑵ Independent variable X is assumed to be measured without error : Difficult to satisfy in practice
⑶ The dependent variable is assumed to be determined by the independent variable
⑷ The relationship between X and Y is assumed to be linear
28. Meaning of Multicollinearity
In multiple linear regression, two or more independent variables have strong correlations, causing problems by increasing the standard errors of the regression coefficients
29. Assumptions of ANCOVA
⑴ Homoscedasticity
⑵ Independence
⑶ Normality
⑷ Linearity between covariate and dependent variable
⑸ Parallelism
30. Steps in ANCOVA
⑴ 1st. Verify that the interaction between the independent variable and the covariate is not statistically significant
⑵ 2nd. Calculate the regression line for the covariate for the dependent variable
⑶ 3rd. Calculate regression lines for each level of the independent variable by adjusting the y-intercept to minimize the sum of squares
⑷ 4th. Calculate residuals for data within each level from the respective regression lines
⑸ 5th. Calculate standardized values for the function values of each regression line based on the mean of the covariate for the entire group
⑹ 6th. Display residuals above and below standardized values for each level
⑺ 7th. Perform ANOVA on the adjusted data
31. Meaning of Parallelism
⑴ For example, when calculating regression lines for contaminated mining areas and non-contaminated areas, it refers to the assumption that the slopes are the same
⑵ If parallelism is not satisfied, comparing differences for a selected value (e.g., overall average age) cannot represent the entire range of covariates
32. Why Overfitting Should Be Avoided in Machine Learning
⑴ Overfitting incorporates inaccuracies in the sample, leading to reduced predictive accuracy
⑵ In practical machine learning, errors are intentionally introduced at each step
Input : 2019-12-10 00:07