Chapter 14. Statistical Test
Higher category: 【Statistics】 Statistics Overview
1. terminology
3. generalized likelihood ratio test
4. p value
1. terminology
⑴ test
① definition : verifying if the hypothesis is statistically significant
○ application 1. randomization check (balance test): verifying that the random sampling goes well
○ application 2. causal effect: verifying that a particular treatment makes a significant change
② test statistic: summarizing the n-dimensional information of the state space in one dimension and using it for statistical test
○ example: Z, T, χ2, F, etc
○ being able to be summarized in one dimension is important when the size of the critical region is constant
③ parametric test
○ definition: testing parameters based on test statistics
○ in general, it is assumed that the distribution of the population is normal distribution: the central limit theorem is used for this assumption
○ in reality, using a parametric test on any sample without the above assumption is not a big problem
④ Non-parametric test
○ Definition: A method of testing non-parametric characteristics through test statistics.
○ Used when the population distribution cannot be specified (distribution-free method).
○ Compared to parametric methods, the calculation of statistics is simpler and more intuitive to understand.
○ Less affected by outliers.
○ The reliability of the test statistics is often insufficient.
⑵ hypothesis
① null hypothesis (H0): a hypothesis to be tested directly
② alternative hypothesis (H1): a hypothesis to be accepted when null hypothesis is rejected
○ Also known as the research hypothesis.
③ characteristics: for parameter θ,
○ H0 : Θ0 ={θ0, θ0’, θ0’’, ···}
○ H1 : Θ1 ={θ1, θ1’, θ1’’, ···}
○ characteristic 1. p(θ ∈ Θ0 or θ ∈ Θ1) = 1
○ characteristic 2. p(θ ∈ Θ0 and θ ∈ Θ1) = 0
④ classification
○ simple hypothesis: in the cases of Θ ={θ0}, Θ ={θ1}, ···
○ composite hypothesis: if the hypothesis is not a simple hypothesis
○ example: the hypotheses that H0 : θ ≤ θ0, H1 : θ > θ0 are composite hypotheses
⑶ introduction of critical region
① state space = critical region + acceptable region
○ Rejection Region: The range of test statistics that lead to the rejection of the null hypothesis.
○ sample ∈ critical region: H0 is rejected
○ sample ∉ critical region: H0 is not rejected
② power function πC(θ): the probability of the samples being included in the critical region when the critical region is C and the parameter is θ

③ an example of power function
○ p(x) = 4x3 / θ4 I{0 < x < θ}
○ C ={x x ≤ 0.5, x > 1}
○ θ ≤ 0.5
○ 0.5 < θ ≤ 1
○ 1 < θ
④ size of critical region (size of test): the maximum value of the probability of samples being included in the critical region when the null hypothesis is true

⑤ power : the probability of samples being included in the critical region when the alternative hypothesis is true. it is also the probability of null hypothesis to be rejected when the alternative hypothesis is true

⑥ error : making a wrong statistical conclusion
○ ideal critical region

○ type Ⅰ error
○ definition : the error rejecting null hypothesis when null hypothesis is true
○ condition : defined when null hypothesis is simple hypothesis
○ the probability of type Ⅰ error (α) = the size of the critical region
○ significance level: 10%, 5%, 1%, etc
○ confidence level: 90%, 95%, 99%, etc
○ type Ⅱ error
○ definition : the error adopting null hypothesis when alternative hypothesis is true
○ condition : defined when alternative hypothesis is simple hypothesis
○ the probability of type Ⅱ error (β) = 1 - power
○ trade-off between α and β

○ the critical region appears as an interval larger than a specific value or less than a specific value (∵ Neyman-Pearson lemma)
○ both α and β cannot be reduced
⑷ comparison of critical region
① criterion: power should be greater when the critical region is same in size
② more powerful testing: for a specific θ1 ∈ Θ1 and two critical regions C1, C2,

③ most powerful testing : for a specific θ1 ∈ Θ1 and any critical region C,

④ uniformly most powerful testing: for any θ ∈ Θ1 and any critical region C,

2. Neyman-Pearson lemma
⑴ idea
① premise: H0 : θ = θ0, H1 : θ = θ1 (simple hypothesis)
② question : finding a critical region that maximizes the power when the size of the critical region is constant
③ speculation: samples from a state space are included in the critical region C one by one
○ when a sample x is included C, both p(x, θ0) and p(x, θ1) increase
○ p(x, θ0): a kind of cost. the increase of p(x, θ0) increases the size of the critical region
○ p(x, θ1): a kind of benefit. the increase of p(x, θ1) increases the power
④ conclusion
○ line-up strategy: it is advantageous to include a sample x having more p(x, θ1) ÷ p(x, θ0) in the critical region in a faster order
○ the critical region made by the line-up strategy, C ={x | p(x, θ1) ÷ p(x, θ0) ≥ k}, is a critical region for uniformly most powerful testing
⑵ lemma
① premise : H0, H1 are simple hypotheses
② statement : for any k ∈ ℝ, if we take the following critical region it will be a critical region for uniformly most powerful testing

○ ℒ : likelihood function
○ likelihood ratio test (LR test): a test like λ(x) ≥ k
○ determination of critical region: to know the exact form of critical region, the size of the critical region should be given
○ every x satisfying λ(x) ≥ k is included in critical region C*
○ every x satisfying λ(x) < k is not included in critical region C*
③ application
○ as only the order of p(x, θ1) ÷ p(x, θ0) is important, the following conversion using a monotone increasing function f(·) is allowed

○ terms related to θ0, θ1, n, etc are easily removed
○ point: the modification of critical region is allowed as far as the existence of k’ is ensured
○ determination of critical region: to know the exact form of critical region the size of the critical region should be given
⑶ proof
① assumption : C* and C are same in size

② definition of C*

③ conclusion: C* is a critical region with the uniformly most powerful testing

⑷ example 1.
① X1, ···, Xn ~ Bernoulli(θ)

② H0 : θ = θ0, H1 : θ = θ1 > θ0
③ likelihood ratio test

④ Z-test (confidence level : α)
○ θ1 > θ0 : one-tailed test

○ θ1 < θ0 : one-tailed test

○ critical region with uniformly most powerful testing does not exist because the size of critical region of the most powerful testing depends on whether θ0 is bigger than θ1 or not
⑸ example 2.
① X1, ···, Xn ~ N(μ, 12)

② H0 : μ = μ0, H1 : μ = μ1 > μ0
③ likelihood ratio test

④ Z-test : one-tailed test (confidence level : α)

⑹ generalization 1. the form of the critical region is constant no matter whether H1 is a composite hypothesis or not, when the critical region does not depend on the specific values of θ1
① X1, ···, Xn ~ Bernoulli(θ)
② H0 : θ = θ0, H1 : θ > θ0
⑺ generalization 2. in the generalization 1, the form of the critical region is constant when the alternative hypothesis is a composite hypothesis including θ0 and α is the maximum if θ = θ0
① X1, ···, Xn ~ Bernoulli(θ)
② H0 : θ < θ0, H1 : θ > θ0
3. generalized likelihood ratio test
⑴ definition
① the limit of Neyman-Pearson lemma : in general, the null hypothesis and the alternative hypothesis should be simple hypotheses
② GLR test (generalized likelihood ratio test)

③ max p(x, θ) utilizes the maximum likelihood method (ML)
④ this method has been proven to set a statistically not bad critical regions
⑵ example 1. Xi ~ N(μ, σ2), σ2 is known
① H0 : μ = μ0, H1 : μ ≠ μ0
② generalized likelihood ratio test

③ τ-test: one-tailed test (confidence level: α)

④ Z-test : two-tailed test (confidence level: α)

⑤ it is proven that even if Xi does not follow the normal distribution, the above method can be applied approximately
⑶ example 2. Xi ~ N(μ, σ2), σ2 is unknown
① H0 : μ = μ0, H1 : μ ≠ μ0
② generalized likelihood ratio test

③ F-test: one-tailed test (confidence level: α)

④ T-test: two-tailed test (confidence level: α)

⑷ example 3. Xi ~ N(μ, σ2), σ2 is unknown
① H0 : μ = μ0, H1 : μ > μ0
② generalized likelihood ratio test

③ key assumptions
○ Xavg ≥ μ0 has a higher likelihood ratio than Xavg < μ0, so the former has a higher priority in the line-up strategy than the latter
○ as the significance level is only 0.025, 0.05, and 0.10 at most, it is sufficient to consider Xavg ≥ μ0 having the half of the full probable cases
④ T-test : one-tailed test (confidence level: α)

⑤ H1 : the same logic is applied even if μ < μ0
⑸ example 4. Xi ~ N(μ, σ2), μ is unknown
① H0 : σ2 = σ02, H1 : σ2 ≠ σ02
② generalized likelihood ratio test

③ setting the critical region
○ f(τ) is a function that is convex downwards with a minimal value at τ = n
○ condition 1. P(τ ≥ k’ | H0) + P(τ ≤ k’’ | H0) = α
○ condition 2. f(k’) = f(k’’)

④ τ-test: two-tailed test (confidence level: α)
○ numerical analysis is required to set an ideal critical region
○ in practice, simpler critical regions are used

⑹ example 5. special likelihood ratio test
① definition
○ in the case that Xi ~ N(μ, σ2) and σ2 is known, 2 ln λ ~ χ2(1)
○ if the sample size n is large enough, the following is mathematically demonstrated for the number of parameters, i.e. k

② τ-test: one-tailed test (confidence level : α)

③ supplements
○ some statisticians only refer to these tests as the likelihood ratio test (LR test)
○ some statisticians define -2 ln λ = 2 ln ℒ(H1) - 2 ln ℒ(H0))
4. p value
⑴ definition: probability of more extreme values than a given sample when null hypothesis is true
① Another definition: probability of null hypothesis being true
② Rejecting only if the test statistic is included in the critical region and rejecting only if the p value is less than α are necessary and sufficient conditions
③ A strict definition
⑵ calculation : θ* is a measured value
① right-sided test: p value = P(θ ≥ θ*)
② left-sided test: p value = P(θ ≤ θ*)
③ symmetric distribution about μ: p value = P(|θ - μ| ≥ |θ* - μ|)
④ chi-squared distribution: if θ* is bigger than the median, p value = P(θ ≥ θ*). if θ* is smaller than the median, p value = P(θ ≤ θ*)
⑶ power and p value
① the main issues in classical statistics are finding distribution and increasing power
② strict meaning: high power means that if the alternative hypothesis is true when α is constant, the probability of rejecting the null hypothesis is high
③ meaning that α is constant
○ meaning to define a constant Maginot line for each distribution obtained from various statistical techniques
○ meaning that many other cases other than a given sample are regarded as the null hypothesis is true even if they are not necessarily indicating the true null hypothesis
④ meaning of increasing 1-β: meaning of making the Maginot Line more extreme in various statistical techniques
⑤ intuitive meaning: higher power means that we will use statistical techniques that show a smaller p value when α is constant
⑥ example 1. for the same sample, using the F statistic than the t statistic has a smaller pvalue → higher power
⑦ example 2. the t-distribution becomes narrower as the degree of freedom increases → the power increases
⑧ different statistical techniques have different power: meaning that statistical conclusions may differ for the same statistical data
⑷ Example : correlation coefficient and p value.
① H0 : X and Y are not correlated
② meaning of p value : probability that the correlation coefficient of the sample group taken from the uncorrelated population is greater than the given correlation coefficient
③ assumtion of the calculation of the value through the normal distribution
○ random sampling data
○ two-variate normal distribution: two variables X and Y follow normal distribution
○ linear relationship: the relationship of two-order or three-order is not suitable
○ failure to meet three conditions above, p value must be calculated by non-parametric test
⑸ Multiple testing problem
① Overview
○ Assume that the p-value follows a uniform distribution under the null hypothesis H0.
○ Proof: Under the null hypothesis, let the CDF of S be F0. If F0 is a non-decreasing function, then…
○ Problem Definition: Suppose we test 1,000 hypotheses and reject the null hypothesis for each hypothesis with a p-value less than α = 0.05. In this case, how many null hypotheses would we expect to be incorrectly rejected? The answer is approximately 50 (∵ 1000 × 0.05 = 50). Thus, we cannot assume that all rejected hypotheses are significant.
○ Key Issue: Conducting multiple statistical tests inherently increases the likelihood of inaccurate conclusions.
○ Example: This problem is particularly relevant when identifying differentially expressed genes (DEGs) from sequencing data consisting of multiple genes.
② Solution 1: Controlling the Family-Wise Error Rate (FWER)
○ Definition: The probability of making at least one incorrect conclusion among all hypotheses.
○ For instance, a 5% FWER means that the probability of making even a single incorrect conclusion is less than or equal to 5%. This approach is very conservative and minimizes false positives.
○ FWER is sometimes criticized as inducing low power in the sense of leading to making many type II errors.
○ Methods 1. Sidak Correction: Adjusts the alpha threshold instead of the p-values. Used when p-values are independent.
○ d: Number of statistical tests
○ Method 2. Bonferroni Correction: Adjusts individual p-values directly. Can be applied even if p-values are not independent. Very conservative.
○ d: Number of statistical tests
○ Note: If the adjusted p-value exceeds 1, it is forcibly set to 1.
○ FWER proof at α
○ Let the number of statistical tests be m, and assume that each test is independent (this assumption is required for the union bound condition).
○ I0 is a fixed but unknown set, which is presumed to be primarily composed of null hypotheses with high p-values.
○ Method 3. Holm (step-down) procedure
○ Step 1. Order the p-values, obtaining P(1) ≤ ··· ≤ P(m).
○ Step 2. Let R denote the smallest r ≥ 0 such that P(r+1) > α / (m-r).
○ Step 3. If R > 0, reject H(1), ···, H(R), where H(i) is associated with P(i).
○ FWER proof at α
○ Let the number of statistical tests be m, and assume that each test is independent (this assumption is required for the union bound condition).
○ I0 is a fixed but unknown set, which is presumed to be primarily composed of null hypotheses with high p-values.
○ At the same α, Holm is more powerful than Bonferroni.
○ Method 4. Hochberg (step-up) procedure
○ Step 1. Order the p-values, obtaining P(1) ≤ ⋯ ≤ P(m).
○ Step 2. Let R denote the largest r ≥ 0 such that P(r) ≤ α / (m + 1 - r).
○ Step 3. If R > 0, reject H(1), ⋯, H(R), where H(i) is associated with P(i).
○ Intuitive understanding of FWER control at significance level α
○ Assume the number of statistical tests is m, and that each test is independent (required for the independence condition)
○ Let I0 be a fixed but unknown set, assumed to primarily include null hypotheses with large p-values
○ Note: The inequality m - j0 + 1 ≥ m0 does not necessarily hold, so the following derivation is for reference only
○ When using the same significance level α, Hochberg is more powerful than Holm
○ Intuition: Holm uses a “for all” condition, while Hochberg uses a “for some” condition
○ Method 5. Tukey-Kramer honest significant difference (range test)
○ This procedure applies for performing all pairwise comparisons in a multiple sample situation.
○ Null hypothesis: Hjk : μj = μk
○ J samples (Yij : i = 1, ···, nj), j = 1, ···, J
○ N = n1 + ··· + nJ
○ μj: Population mean of the group j
○ Statistic: At level α, Tukey-Kramer rejects Hjk if
○ Theory: When the samples are independent, normal and with same variance, and when the sample sizes are equal, Tukey-Kramer controls the FWER exactly at level α.
③ Solution 2: Controlling the False Discovery Rate (FDR)
○ Overview
○ Definition: Limits the proportion of incorrect conclusions (false discoveries) among hypotheses where the null hypothesis is rejected to a certain level.
○ FWER control implies FDR control (at the same level α).
○ By considering the p-value distributions under both H0 and H1, a less conservative statistical test can be performed.
○ Method 1. Benjamini–Hochberg (B&H): Suitable when the correlations among tests are simple.
○ Step 1. Order the p-values, obtaining P(1) ≤ ⋯ ≤ P(m).
○ Step 2. Let R denote the largest r such that P(r) ≤ rα / m.
○ Step 3. If R > 0, reject H(1), ⋯, H(R), where H(i) is associated with P(i).
○ Like the Hotchberg procedure, this is a step-up procedure (starts from the least significant p-value) but the thresholds are much different.
○ Hotchberg compares P(j) with α / (m - j + 1).
○ Benjamini-Hochberg compares P(j) with jα / m.
○ An Intuitive Understanding of FDR Proof at Significance Level α
○ Assuming Independence Among Statistical Tests
○ Adjusted p value
○ d: Number of statistical tests
○ rank: Sorting order of p-values
○ Note: If the adjusted p-value exceeds 1, it is forcibly set to 1.
○ The lower the rank (e.g., rank = 1), the lower the p-value should be. If this condition is not met, there is a step to adjust it.
○ Example: For significance level α, total tests m, and the i-th smallest p-value p(i)
Gene | p-val | Rank | Initial Adj p-val | Final Adj p-val |
---|---|---|---|---|
A | 0.039 | 3 | 0.039 × (25/3) = 0.325 | 0.21 |
B | 0.001 | 1 | 0.001 × (25/1) = 0.025 | 0.025 |
C | 0.041 | 4 | 0.041 × (25/4) = 0.256 | 0.21 |
D | 0.042 | 5 | 0.042 × (25/5) = 0.21 | 0.21 |
E | 0.008 | 2 | 0.008 × (25/2) = 0.1 | 0.1 |
… | … | … | … | … |
Table 1. Example of B&H Test with 25 Genes
○ Method 2. Benjamini–Yekutieli (B&Y): Suitable for cases with complex correlations among tests.
○ Whether the tests are independent or not, Benjamini-Yekutieli controls the FDR at α.
○ Adjusted p value
○ d: Number of statistical tests
○ rank: Sorting order of p-values
○ ∑i=1d i/1: Adjusted constant to more conservatively control FDR by accounting for test correlations.
○ Note: If the adjusted p-value exceeds 1, it is forcibly set to 1.
④ Adjusted p-value: Introduced to apply the same significance level α across different correction methods.
5. Types of Statistical Tests
⑴ Comprehensive Summary of Statistical Test Examples
⑸ Run Test
⑹ Fisher Exact Test (hypergeometric test)
⑻ Cochran-Mantel-Haenszel (CMH) Test
Input: 2019.06.19 14:52