Korean, Edit

Summary of Statistics

Recommended Articles : 【Statistics】 Table of Contents



1. Data, Information, Knowledge

⑴ Data : Given data

⑵ Information : Name of the data

⑶ Knowledge : Relationship between information and information



2. Ratio Scale, Interval Scale, Ordinal Scale, Nominal Scale

⑴ Ratio Scale : Absolute zero point exists. Ratio concept exists. Absolute zero, etc.

⑵ Interval Scale : Absolute zero point does not exist. Ratio concept does not exist. Celsius temperature, etc.

⑶ Ordinal Scale : Order concept

⑷ Nominal Scale : Gender, etc.



3. Difference between Hypothetico-Deductive Method and Data Science

⑴ Hypothetico-Deductive Method sets a hypothesis first and then conducts experiments

⑵ Data Science conducts experiments first and then sets hypotheses



4. Accuracy and Precision

⑴ Accuracy is a concept about how close the sample mean is to the population mean

⑵ Precision is a concept about how small the variance of the sample is



5. Confounding Effect

⑴ When a third factor affects both the manipulated variable and the dependent variable

⑵ Why correlation does not imply causation



6. Meaning of Batch Effect

⑴ Incorrect repeated experiments

⑵ When control variables applied to batches are not properly controlled, it leads to incorrect statistical conclusions



7. Meaning of Observational Experiment and Experimental Experiment

⑴ Observational Experiment is testing hypotheses without changing conditions. Applied in data science.

⑵ Experimental Experiment is testing hypotheses by changing conditions. Applied in general scientific methodology.



8. Difference between Replication Experiment and Replication Measurement

⑴ Replication Experiment is related to accuracy. Testing multiple patients in drug testing, for example.

⑵ Replication Measurement is related to precision. Testing the same patient multiple times in drug testing, for example.



9. Comparison between Mean and Median


image


⑴ Left is median, right is mean

⑵ The areas on both sides should be equal with respect to the median



10. Quantile vs. Quantile Plot


image



11. Meaning of Independence

⑴ One variable provides no information about another variable

⑵ P(X = x, Y = y) = P(X = x) × P(Y = y)



12. Central Limit Theorem

Regardless of the population distribution, the distribution of sample means follows a normal distribution



13. Chi-Square Goodness-of-Fit Test


image



14. Chi-Square Independence Test


image



15. Characteristics of the t-Distribution

⑴ T-distribution is fatter-tailed compared to the standard normal distribution



16. Considerations when Formulating the Alternative Hypothesis

⑴ There should be potential for refutation



17. Parametric Testing vs. Non-Parametric Testing

⑴ Parametric Testing

① Generally when the distribution of samples follows a normal distribution

② Calculation of p-value through parameters

⑵ Non-Parametric Testing

① Generally when the distribution of samples does not follow a normal distribution

② Calculation of p-value without parameters



18. Reasons for One-Tailed Testing

⑴ Situation : When there is confidence that it is not on one side

⑵ Advantages : Can reduce Type II errors

① Type II Error : Rejecting the null hypothesis when the alternative hypothesis is true

② Possibility of obtaining the desired conclusion from the experimenter’s perspective

Disadvantage 1. Can lead to incorrect statistical conclusions: p-value is underestimated

Disadvantage 2. Need to convince oneself



19. Experimental Design **: Shoe Experiment

⑴ Left shoe and right shoe have significant differences in shape and usage, how to compare them is the issue



20. Experimental Design

⑴ Problem : Whether to experiment on the genetic diversity group or the genetic unified group

⑵ Answer : Genetic diversity group

① In practice, it is the genetic diversity group : Issues with the effectiveness of conclusions drawn from the genetic unified group

② Significant conclusions can be obtained through post-tests : Subsequent application of the hypothetico-deductive method



21. Power of the Test

⑴ Increasing the power means using statistical techniques where the p-value is smaller while α is constant

Example 1. T-distribution has higher power with higher degrees of freedom

① Higher degrees of freedom make the t-distribution more similar to the normal distribution

② Higher degrees of freedom result in a narrower t-distribution, leading to higher power

Example 2. Two-sample t-test has higher power than paired sample t-test

① Paired sample t-test : Essentially one variable. Degrees of freedom are n-1

② Two-sample t-test : Two variables. Degrees of freedom are n+m-2

③ Increasing degrees of freedom in the t-distribution leads to higher power, so two-sample t-test has higher power

Example 3. In two-sample t-test, having the assumption of equal variances results in higher power compared to when there is no assumption of equal variances

① When there is the assumption of equal variances, degrees of freedom


image


② When there is no assumption of equal variances, degrees of freedom


image


Example 4. In regression analysis, F-test has higher power than t-test

Example 5. Performing ANCOVA on data that satisfies the parallelism assumption has higher power than comparing y-intercepts of each regression line in data

① Comparing y-intercepts has the sample size level of one sample group

② ANCOVA calculates with the entire error term, so the sample size level is twice that of one sample group



22. Reasons Not to Conduct Pairwise t-Tests When There Are Multiple Groups

Due to the accumulation of Type I errors, even groups with little difference are likely to be concluded as having differences



23. Assumptions of ANOVA Analysis

⑴ Normality : All data are sampled from populations that follow normal distributions

⑵ Independence : All data are independently sampled from populations

⑶ Homoscedasticity : All data, regardless of different means, are sampled from populations with the same variance



24. Meaning of Robustness

Even in cases of heteroscedasticity or non-normality, statistical conclusions (accepting or rejecting null hypothesis) do not change with a large sample size or repeated measures within categories



25. Assumptions of Linear Correlation

⑴ Randomly sampled data

⑵ Each variable is sampled from populations that follow normal distributions

⑶ The relationship appears linear



26. Difference between Correlation and Regression Analysis

⑴ Correlation simply indicates the degree of the relationship between variables

⑵ Regression analysis indicates the causal relationship between independent variables and the dependent variable. Since the goal is prediction, it does not necessarily prove the actual causal relationship



27. Assumptions of Regression Model

⑴ Y values are assumed to be measured from populations that satisfy normality and homoscedasticity

⑵ Independent variable X is assumed to be measured without error : Difficult to satisfy in practice

⑶ The dependent variable is assumed to be determined by the independent variable

⑷ The relationship between X and Y is assumed to be linear



28. Meaning of Multicollinearity

In multiple linear regression, two or more independent variables have strong correlations, causing problems by increasing the standard errors of the regression coefficients



29. Assumptions of ANCOVA

⑴ Homoscedasticity

⑵ Independence

⑶ Normality

⑷ Linearity between covariate and dependent variable

⑸ Parallelism



30. Steps in ANCOVA

⑴ 1st. Verify that the interaction between the independent variable and the covariate is not statistically significant

⑵ 2nd. Calculate the regression line for the covariate for the dependent variable

⑶ 3rd. Calculate regression lines for each level of the independent variable by adjusting the y-intercept to minimize the sum of squares

⑷ 4th. Calculate residuals for data within each level from the respective regression lines

⑸ 5th. Calculate standardized values for the function values of each regression line based on the mean of the covariate for the entire group

⑹ 6th. Display residuals above and below standardized values for each level

⑺ 7th. Perform ANOVA on the adjusted data



31. Meaning of Parallelism

⑴ For example, when calculating regression lines for contaminated mining areas and non-contaminated areas, it refers to the assumption that the slopes are the same

⑵ If parallelism is not satisfied, comparing differences for a selected value (e.g., overall average age) cannot represent the entire range of covariates



32. Why Overfitting Should Be Avoided in Machine Learning

⑴ Overfitting incorporates inaccuracies in the sample, leading to reduced predictive accuracy

⑵ In practical machine learning, errors are intentionally introduced at each step



Input : 2019-12-10 00:07

results matching ""

    No results matching ""