Chapter 14. Statistical Test

Higher category: 【Statistics】 Statistics Overview

1. terminology

2. Neyman-Pearson lemma

3. generalized likelihood ratio test

4. p value

5. Types of Statistical Tests

1. terminology

⑴ test

① definition: verifying if the hypothesis is statistically significant

○ application 1. randomization check (balance test): verifying that the random sampling goes well

○ application 2. causal effect: verifying that a particular treatment makes a significant change

② test statistic: summarizing the n-dimensional information of the state space in one dimension and using it for statistical test

○ example: Z, T, χ², F, etc

○ being able to be summarized in one dimension is important when the size of the critical region is constant

③ parametric test

○ definition: testing parameters based on test statistics

○ in general, it is assumed that the distribution of the population is normal distribution: the central limit theorem is used for this assumption

○ in reality, using a parametric test on any sample without the above assumption is not a big problem

④ Non-parametric test

○ Definition: A method of testing non-parametric characteristics through test statistics.

○ Used when the population distribution cannot be specified (distribution-free method).

○ Compared to parametric methods, the calculation of statistics is simpler and more intuitive to understand.

○ Less affected by outliers.

○ The reliability of the test statistics is often insufficient.

⑵ hypothesis

① null hypothesis (H₀): a hypothesis that states there is no change or difference compared to the existing condition or what has been previously claimed

② alternative hypothesis (H₁): a hypothesis to be accepted when null hypothesis is rejected

○ Also known as the research hypothesis.

③ characteristics: for parameter θ,

○ H₀: Θ₀ =｛θ₀, θ₀’, θ₀’’, ···｝

○ H₁: Θ₁ =｛θ₁, θ₁’, θ₁’’, ···｝

○ characteristic 1. p(θ ∈ Θ₀ or θ ∈ Θ₁) = 1

○ characteristic 2. p(θ ∈ Θ₀ and θ ∈ Θ₁) = 0

④ classification

○ simple hypothesis: in the cases of Θ =｛θ₀｝, Θ =｛θ₁｝, ···

○ composite hypothesis: if the hypothesis is not a simple hypothesis

○ example: the hypotheses that H₀: θ ≤ θ0, H₁: θ ＞ θ0 are composite hypotheses

⑶ introduction of critical region

① state space = critical region + acceptable region

○ Rejection Region: The range of test statistics that lead to the rejection of the null hypothesis.

○ sample ∈ critical region: H0 is rejected

○ sample ∉ critical region: H0 is not rejected

② power function π_C(θ): the probability of the samples being included in the critical region when the critical region is C and the parameter is θ

③ an example of power function

○ p(x) = 4x³ / θ⁴ I｛0 ＜ x ＜ θ｝

○ C =｛x x ≤ 0.5, x ＞ 1｝

○ θ ≤ 0.5

π_C(θ) = 1

○ 0.5 ＜ θ ≤ 1

π_C(θ) = ∫ p(x) dx (assuming x ∈ [0, 0.5] ) = 1 / 16θ⁴

○ 1 ＜ θ

π_C(θ) = ∫ p(x) dx (assuming x ∈ [0, 0.5] ∪[1, θ] ) = 1 - 15 / 16θ⁴

④ size of critical region (size of test): the maximum value of the probability of samples being included in the critical region when the null hypothesis is true

⑤ power: the probability of samples being included in the critical region when the alternative hypothesis is true. it is also the probability of null hypothesis to be rejected when the alternative hypothesis is true

⑥ error: making a wrong statistical conclusion

○ ideal critical region

○ type Ⅰ error

○ definition: the error rejecting null hypothesis when null hypothesis is true

○ condition: defined when null hypothesis is simple hypothesis

○ the probability of type Ⅰ error (α) = the size of the critical region

○ significance level: 10%, 5%, 1%, etc

○ confidence level: 90%, 95%, 99%, etc

○ type Ⅱ error

○ definition: the error adopting null hypothesis when alternative hypothesis is true

○ condition: defined when alternative hypothesis is simple hypothesis

○ the probability of type Ⅱ error (β) = 1 - power

○ trade-off between α and β

Figure 1. trade-off between type Ⅰ error (α) and type Ⅱ error (β)

○ the critical region appears as an interval larger than a specific value or less than a specific value (∵ Neyman-Pearson lemma)

○ both α and β cannot be reduced

⑷ comparison of critical region

① criterion: power should be greater when the critical region is same in size

② more powerful testing: for a specific θ1 ∈ Θ1 and two critical regions C₁, C₂,

③ most powerful testing: for a specific θ₁ ∈ Θ₁ and any critical region C,

④ uniformly most powerful testing: for any θ ∈ Θ₁ and any critical region C,

2. Neyman-Pearson lemma

⑴ idea

① premise: H0: θ = θ₀, H₁: θ = θ₁ (simple hypothesis)

② question: finding a critical region that maximizes the power when the size of the critical region is constant

③ speculation: samples from a state space are included in the critical region C one by one

○ when a sample x is included C, both p(x, θ₀) and p(x, θ₁) increase

○ p(x, θ₀): a kind of cost. the increase of p(x, θ₀) increases the size of the critical region

○ p(x, θ₁): a kind of benefit. the increase of p(x, θ₁) increases the power

④ conclusion

○ line-up strategy: it is advantageous to include a sample x having more p(x, θ₁) ÷ p(x, θ₀) in the critical region in a faster order

○ the critical region made by the line-up strategy, C =｛x | p(x, θ₁) ÷ p(x, θ₀) ≥ k｝, is a critical region for uniformly most powerful testing

⑵ lemma

① premise: H₀, H₁ are simple hypotheses

② statement: for any k ∈ ℝ, if we take the following critical region it will be a critical region for uniformly most powerful testing

○ ℒ: likelihood function

○ likelihood ratio test (LR test): a test like λ(x) ≥ k

○ determination of critical region: to know the exact form of critical region, the size of the critical region should be given

○ every x satisfying λ(x) ≥ k is included in the critical region C*

○ every x satisfying λ(x) ＜ k is included in the complement of the critical region C*

③ application

○ as only the order of p(x, θ₁) ÷ p(x, θ₀) is important, the following conversion using a monotone increasing function f(·) is allowed

○ terms related to θ₀, θ₁, n, etc are easily removed

○ point: the modification of critical region is allowed as far as the existence of k’ is ensured

○ determination of critical region: to know the exact form of critical region the size of the critical region should be given

⑶ proof

① assumption: C* and C are same in size

② definition of C*

③ conclusion: C* is a critical region with the uniformly most powerful testing

⑷ example 1.

① X₁, ···, X_n ~ Bernoulli(θ)

② H₀: θ = θ₀, H₁: θ = θ₁ ＞ θ₀

③ likelihood ratio test

④ Z-test (confidence level: α)

○ θ₁ ＞ θ₀ : one-tailed test

○ θ₁ ＜ θ₀: one-tailed test

○ critical region with uniformly most powerful testing does not exist because the size of critical region of the most powerful testing depends on whether θ₀ is bigger than θ₁ or not

⑸ example 2.

① X₁, ···, X_n ~ N(μ, 1²)

② H₀: μ = μ₀, H₁: μ = μ₁ ＞ μ₀

③ likelihood ratio test

④ Z-test: one-tailed test (confidence level: α)

⑹ generalization 1. the form of the critical region is constant no matter whether H₁ is a composite hypothesis or not, when the critical region does not depend on the specific values of θ₁

① X₁, ···, X_n ~ Bernoulli(θ)

② H₀: θ = θ₀, H₁: θ ＞ θ₀

⑺ generalization 2. in the generalization 1, the form of the critical region is constant when the alternative hypothesis is a composite hypothesis including θ0 and α is the maximum if θ = θ₀

① X₁, ···, X_n ~ Bernoulli(θ)

② H₀: θ ＜ θ₀, H₁: θ ＞ θ₀

3. generalized likelihood ratio test

⑴ definition

① the limit of Neyman-Pearson lemma: in general, the null hypothesis and the alternative hypothesis should be simple hypotheses

② GLR test (generalized likelihood ratio test)

③ max p(x, θ) utilizes the maximum likelihood method (ML)

④ this method has been proven to set a statistically not bad critical regions

⑵ example 1. X_i ~ N(μ, σ²), σ² is known

① H₀: μ = μ₀, H₁: μ ≠ μ₀

② generalized likelihood ratio test

③ τ-test: one-tailed test (confidence level: α)

④ Z-test: two-tailed test (confidence level: α)

⑤ it is proven that even if Xi does not follow the normal distribution, the above method can be applied approximately

⑶ example 2. X_i ~ N(μ, σ²), σ² is unknown

① H₀: μ = μ₀, H₁: μ ≠ μ₀

② generalized likelihood ratio test

③ F-test: one-tailed test (confidence level: α)

④ T-test: two-tailed test (confidence level: α)

⑷ example 3. X_i ~ N(μ, σ²), σ² is unknown

① H₀: μ = μ₀, H₁: μ ＞ μ₀

② generalized likelihood ratio test

③ key assumptions

○ X_avg ≥ μ₀ has a higher likelihood ratio than X_avg ＜ μ₀, so the former has a higher priority in the line-up strategy than the latter

○ as the significance level is only 0.025, 0.05, and 0.10 at most, it is sufficient to consider Xavg ≥ μ0 having the half of the full probable cases

④ T-test: one-tailed test (confidence level: α)

⑤ H₁: the same logic is applied even if μ ＜ μ₀

⑸ example 4. X_i ~ N(μ, σ²), μ is unknown

① H₀: σ² = σ₀², H₁: σ² ≠ σ₀²

② generalized likelihood ratio test

③ setting the critical region

○ f(τ) is a function that is convex downwards with a minimal value at τ = n

○ condition 1. P(τ ≥ k’ | H₀) + P(τ ≤ k’’ | H₀) = α

○ condition 2. f(k’) = f(k’’)

④ τ-test: two-tailed test (confidence level: α)

○ numerical analysis is required to set an ideal critical region

○ in practice, simpler critical regions are used

⑹ example 5. special likelihood ratio test

① definition

○ in the case that X_i ~ N(μ, σ²) and σ² is known, 2 ln λ ~ χ²(1)

○ Wilks’ phenomenon: If the sample size n is large enough, the following is mathematically demonstrated for the number of parameters, i.e. k

② τ-test: one-tailed test (confidence level: α)

③ Proof

④ Examples

○ Example 1. Given X₁, ⋯, X_n ∼ Poisson(λ) and the null hypothesis H₀: λ = λ₀, H₁: λ ≠ λ₀, find the critical region at significance level α.

○ Example 2. Let y₁, ⋯, y₅ follow a multinomial distribution with parameter θ = (p₁, ⋯, p₅), and define L(θ) = p₁^y₁ ⋯ p₅^y₅. Given the null hypothesis H₀: p₁ = p₂ = p₃, p₄ = p₅ and the alternative hypothesis H₁, find the rejection region at significance level α.

⑤ Supplements

○ some statisticians only refer to these tests as the likelihood ratio test (LR test)

○ some statisticians define -2 ln λ = 2 ln ℒ(H₁) - 2 ln ℒ(H₀))

4. p value

⑴ definition: probability of more extreme values than a given sample when null hypothesis is true

① Another definition: probability of null hypothesis being true

② Rejecting only if the test statistic is included in the critical region and rejecting only if the p value is less than α are necessary and sufficient conditions

③ A strict definition

⑵ calculation: θ* is a measured value

① right-sided test: p value = P(θ ≥ θ*)

② left-sided test: p value = P(θ ≤ θ*)

③ symmetric distribution about μ: p value = P(|θ - μ| ≥ |θ* - μ|)

④ chi-squared distribution: if θ* is bigger than the median, p value = P(θ ≥ θ*). if θ* is smaller than the median, p value = P(θ ≤ θ*)

⑶ power and p value

① the main issues in classical statistics are finding distribution and increasing power

② strict meaning: high power means that if the alternative hypothesis is true when α is constant, the probability of rejecting the null hypothesis is high

③ meaning that α is constant

○ meaning to define a constant Maginot line for each distribution obtained from various statistical techniques

○ meaning that many other cases other than a given sample are regarded as the null hypothesis is true even if they are not necessarily indicating the true null hypothesis

④ meaning of increasing 1-β: meaning of making the Maginot Line more extreme in various statistical techniques

⑤ intuitive meaning: higher power means that we will use statistical techniques that show a smaller p value when α is constant

⑥ example 1. for the same sample, using the F statistic than the t statistic has a smaller pvalue → higher power

⑦ example 2. the t-distribution becomes narrower as the degree of freedom increases → the power increases

⑧ different statistical techniques have different power: meaning that statistical conclusions may differ for the same statistical data

⑷ Example : correlation coefficient and p value.

① H₀: X and Y are not correlated

② meaning of p value: probability that the correlation coefficient of the sample group taken from the uncorrelated population is greater than the given correlation coefficient

③ assumtion of the calculation of the value through the normal distribution

○ random sampling data

○ two-variate normal distribution: two variables X and Y follow normal distribution

○ linear relationship: the relationship of two-order or three-order is not suitable

○ failure to meet three conditions above, p value must be calculated by non-parametric test

⑸ Multiple testing problem

① Overview

○ Assume that the p-value follows a uniform distribution under the null hypothesis H₀.

○ Proof: Under the null hypothesis, let the CDF of S be F₀. If F₀ is a non-decreasing function, then…

○ Problem Definition: Suppose we test 1,000 hypotheses and reject the null hypothesis for each hypothesis with a p-value less than α = 0.05. In this case, how many null hypotheses would we expect to be incorrectly rejected? The answer is approximately 50 (∵ 1000 × 0.05 = 50). Thus, we cannot assume that all rejected hypotheses are significant.

○ Key Issue: Conducting multiple statistical tests inherently increases the likelihood of inaccurate conclusions.

○ Example: This problem is particularly relevant when identifying differentially expressed genes (DEGs) from sequencing data consisting of multiple genes.

② Solution 1: Controlling the Family-Wise Error Rate (FWER)

○ Definition: The probability of making at least one incorrect conclusion among all hypotheses.

○ For instance, a 5% FWER means that the probability of making even a single incorrect conclusion is less than or equal to 5%. This approach is very conservative and minimizes false positives.

○ FWER is sometimes criticized as inducing low power in the sense of leading to making many type II errors.

○ Methods 1. Sidak Correction: Adjusts the alpha threshold instead of the p-values. Used when p-values are independent.

○ d: Number of statistical tests

○ Method 2. Bonferroni Correction: Adjusts individual p-values directly. Can be applied even if p-values are not independent. Very conservative.

○ d: Number of statistical tests

○ Note: If the adjusted p-value exceeds 1, it is forcibly set to 1.

○ FWER proof at α

○ Let the number of statistical tests be m, and assume that each test is independent (this assumption is required for the union bound condition).

○ I₀ is a fixed but unknown set, which is presumed to be primarily composed of null hypotheses with high p-values.

○ Method 3. Holm (step-down) procedure

○ Step 1. Order the p-values, obtaining P₍₁₎ ≤ ··· ≤ P_(m).

○ Step 2. Let R denote the smallest r ≥ 0 such that P_(r+1) > α / (m-r).

○ Step 3. If R > 0, reject H⁽¹⁾, ···, H^(R), where H⁽ⁱ⁾ is associated with P_(i).

○ FWER proof at α

○ Let the number of statistical tests be m, and assume that each test is independent (this assumption is required for the union bound condition).

○ I₀ is a fixed but unknown set, which is presumed to be primarily composed of null hypotheses with high p-values.

○ At the same α, Holm is more powerful than Bonferroni.

○ Method 4. Hochberg (step-up) procedure

○ Step 1. Order the p-values, obtaining P₍₁₎ ≤ ⋯ ≤ P_(m).

○ Step 2. Let R denote the largest r ≥ 0 such that P_(r) ≤ α / (m + 1 - r).

○ Step 3. If R > 0, reject H⁽¹⁾, ⋯, H^(R), where H⁽ⁱ⁾ is associated with P_(i).

○ Intuitive understanding of FWER control at significance level α

○ Assume the number of statistical tests is m, and that each test is independent (required for the independence condition)

○ Let I₀ be a fixed but unknown set, assumed to primarily include null hypotheses with large p-values

○ Note: The inequality m - j₀ + 1 ≥ m₀ does not necessarily hold, so the following derivation is for reference only

○ When using the same significance level α, Hochberg is more powerful than Holm

○ Intuition: Holm uses a “for all” condition, while Hochberg uses a “for some” condition

○ Method 5. Tukey-Kramer honest significant difference (range test)

○ This procedure applies for performing all pairwise comparisons in a multiple sample situation.

○ Null hypothesis: H^jk : μ_j = μ_k

○ J samples (Y_ij : i = 1, ···, n_j), j = 1, ···, J

○ N = n₁ + ··· + n_J

○ μ_j: Population mean of the group j

○ Statistic: At level α, Tukey-Kramer rejects H^jk if

○ Theory: When the samples are independent, normal and with same variance, and when the sample sizes are equal, Tukey-Kramer controls the FWER exactly at level α.

③ Solution 2: Controlling the False Discovery Rate (FDR)

○ Overview

○ Definition: Limits the proportion of incorrect conclusions (false discoveries) among hypotheses where the null hypothesis is rejected to a certain level.

○ FWER control implies FDR control (at the same level α).

○ By considering the p-value distributions under both H₀ and H₁, a less conservative statistical test can be performed.

○ Method 1. Benjamini–Hochberg (B&H): Suitable when the correlations among tests are simple.

○ Step 1. Order the p-values, obtaining P₍₁₎ ≤ ⋯ ≤ P_(m).

○ Step 2. Let R denote the largest r such that P_(r) ≤ rα / m.

○ Step 3. If R > 0, reject H⁽¹⁾, ⋯, H^(R), where H⁽ⁱ⁾ is associated with P_(i).

○ Like the Hotchberg procedure, this is a step-up procedure (starts from the least significant p-value) but the thresholds are much different.

○ Hotchberg compares P_(j) with α / (m - j + 1).

○ Benjamini-Hochberg compares P_(j) with jα / m.

○ An Intuitive Understanding of FDR Proof at Significance Level α

○ Assuming Independence Among Statistical Tests

○ Adjusted p value

○ d: Number of statistical tests

○ rank: Sorting order of p-values

○ Note: If the adjusted p-value exceeds 1, it is forcibly set to 1.

○ The lower the rank (e.g., rank = 1), the lower the p-value should be. If this condition is not met, there is a step to adjust it.

○ Example: For significance level α, total tests m, and the i-th smallest p-value p_(i)

Gene	p-val	Rank	Initial Adj p-val	Final Adj p-val
A	0.039	3	0.039 × (25/3) = 0.325	0.21
B	0.001	1	0.001 × (25/1) = 0.025	0.025
C	0.041	4	0.041 × (25/4) = 0.256	0.21
D	0.042	5	0.042 × (25/5) = 0.21	0.21
E	0.008	2	0.008 × (25/2) = 0.1	0.1
…	…	…	…	…

Table 1. Example of B&H Test with 25 Genes

○ Method 2. Benjamini–Yekutieli (B&Y): Suitable for cases with complex correlations among tests.

○ Whether the tests are independent or not, Benjamini-Yekutieli controls the FDR at α.

○ Adjusted p value

○ d: Number of statistical tests

○ rank: Sorting order of p-values

○ ∑_{i=1 to d} i/1: Adjusted constant to more conservatively control FDR by accounting for test correlations.

○ Note: If the adjusted p-value exceeds 1, it is forcibly set to 1.

④ Adjusted p-value: Introduced to apply the same significance level α across different correction methods.

5. Types of Statistical Tests

⑴ Overview

① Summary for Statistical Test

② Simple Testing

⑵ Type 1. One sample categorical

① Summary Statistic: table

② Visualization: Bar chart (= bar plot), Pie chart

③ 1-1. Chi-squared goodnees-of-fit test

④ 1-2. Likelihood Ratio Test

⑤ 1-3. Run Test

⑥ 1-4. Simulation: Monte Carlo simulation (e.g., Permutation)

⑶ Type 2. Multi sample categorical

① Summary Statistic: Contingency table

② Visualization: Segmented bar plot, Side-by-side barplot

③ 2-1. Chi-squared goodnees-of-fit test

④ 2-2. Chi-squared independence test

⑤ 2-3. Fisher’s exact test (hypergeometric test)

⑥ 2-4. Simulation: Monte Carlo simulation, Parametric bootstrap

⑷ Type 3. One sample numerical

① Summary Statistic: Location, Scale

○ Location: Mean, Median, Quantile, etc.

○ Scale: Standard deviation, Median absolute deviation, etc.

② Visualization: Boxplot, Histogram, Q-Q plot (normality check)

③ 3-1. T test

④ 3-2. Chi-squared goodnees-of-fit test: Using data binning

⑤ 3-3. Kolmogorov-Smirnov test

⑥ 3-4. Simulation: Monte Carlo simulation, Nonparametric bootstrap, Parametric bootstrap

⑸ Type 4. Two sample numerical

① Visualization: Side-by-side box plot, Q-Q plot (normality check)

② 4-1. Paired t-test: One-sample. Parametric

③ 4-2. Unpaired t-test with equal variance: Two-sample. Parametric

④ 4-3. Unpaired t-test with unequal variance (Welch t-test): Two-sample. Parametric

⑤ 4-4. Wilcoxon signed rank test: One-sample. Non-parametric

⑥ 4-5. Wilcoxon rank-sum test: Two-sample. Non-parametric

⑦ 4-6. Kolmogorov-Smirnov two-sample test: Two-sample. Non-parametric

⑧ 4-7. Simulation: Monte Carlo simulation (e.g., Permutation), Bootstrap

⑹ Type 5. Multiple sample numerical

① Visualization: Side-by-side box plot

② 5-1. One-way ANOVA: Parametric

○ Assumption: iid, normality, homoscedasticity (but it is not applied to Welch ANOVA F-test)

○ Visualization: residual plot (homoscedasticity check), Q-Q plot (normality check)

③ 5-2. Tukey’s honest significant difference

○ Assumption: Normality, Homoscedasticity (but it is not applied to Welch ANOVA F-test)

④ 5-3. Kruskal-Wallis test: Non-parametric

⑤ 5-4. Friedman test

⑥ 5-5. Two-way ANOVA

○ Visualization: Side-by-side boxplot, Residual plot (homoscedasticity check), Interaction plot

⑦ 5-6. Permutation test

⑺ Type 6. Bivariate paired numerical

① Summary Statistic: Correlation coefficient

② Visualization: Scatter plot

③ 6-1. Pearson correlation

④ 6-2. Spearman correlation

⑤ 6-3. Kendall’s tau correlation

⑥ 6-4. Cochran-Mantel-Haenszel (CMH) test

⑦ 6-5. Kolmogorov-Smirnov independence test

⑧ 6-6. Monte Carlo simulation (e.g., Permutation)

⑻ Type 7. Simple regression

① Visualization: Scatter plot

② 7-1.T test

③ 7-2. Simulation: Nonparametric bootstrap, Parametric bootstrap

Input: 2019.06.19 14:52

1631

Chapter 14. Statistical Test

1. terminology

2. Neyman-Pearson lemma

3. generalized likelihood ratio test

4. p value

5. Types of Statistical Tests

results matching ""

No results matching ""