Chapter 11. Sample Group and Sample Distribution
Higher category : 【Statistics】 Statistics Overview
1. term
2. characteristic of sample groups
1. term
⑴ population: the entire group of interest
⑵ survey
① complete enumeration: to investigate the entire population. it is expensive
② sample enumeration: to survey part of the population
⑶ sample survey
① representative sample: a sample that well reflects the characteristics of the population
② purposive sampling: a sample in which the subjectivity of the investigator intervened to represent the population
③ random sampling: samples without subjectivity of the investigator
○ if each sample has the same probability,
○ characteristic 1. identically distributed
○ characteristic 2. independently distributed: independence exists between samples
○ the two characteristics are called independently identically distributed (i.i.d) and are important advantages of random sampling
2. characteristic of sample groups
⑴ random sample: when randomly extract n samples of X1, ···, Xn,
① each sample is independent
② each sample has the same probability distribution
③ E(Xi) = E(X) = m
④ VAR(Xi) = VAR(X) = σ2
⑵ the relationship between the sample group and population
① about the population mean μ, population variance σ2,
② sample mean
④ sample correlation : similar to the definition of Pearson correlation coefficient ρ(x, y)
○ | rXY | ≤ 1
○ rXY = 1 ⇔ Yi = aXi + b, a > 0
○ rXY = -1 ⇔ Yi = aXi + b, a < 0
⑶ the introduction of a new random variable : sample mean
① the average of sample mean
② the variance of sample mean
① definition : normal distribution approximation of binomial distribution
② generalization : for X with any probability distribution, the sample mean of X can be approximated to normal distribution if n is large enough
3. chi-squared distribution
⑴ Overview
① Sample distribution when the sample statistic is a sample variance
② A distribution obtained by squaring each of the n independent standard normal random variables and then summing them up
③ Special form of gamma distribution with λ = 1/2, r = n/2
⑵ meaning 1. distribution of random variables related to sample variance
① lemma 1. if Z ~ N(0, 1), Y = Z2 ~ χ2(1)
② lemma 2. since (Xi - μ) / σ ~ N(0, 1), its square follows χ2(1)
③ lemma 3. if Zi ~ N(0, 1), W = ∑Zi2 ~ χ2(n)
④ lemma 4. probability distribution of ⑵ when the population mean is known
⑤ lemma 5. expansion of the random variable of ⑵
⑥ lemma 6. A and C are independent: since A and C follow normal distribution, it is a necessary and sufficient condition with COV(A, C) = 0
○ COV(Xi - -Xavg, Xavg) = 0 : intuitively, the remainder of Xi that cannot be explained by Xavg is independent with Xavg itself
⑦ lemma 7. since ψA(t) × ψC(t) = ψB(t) (∵ A and C are independet), A ~ χ2(n-1)
⑶ meaning 2. exponential distribution and chi-squared distribution
⑷ degree of freedom
① First used in a chi-square distribution
② χ2(n) is a chi-square distribution with n degrees of freedom
③ Asymmetric shape skewed to the left as the degree of freedom n is smaller
④ The degree of freedom is in the form of a single rod from n ≥ 3, and the larger the value, the closer to the normal distribution
⑸ characteristic
① χ2(1) = Z(0, 1)2
② Expected value: E(X) = n (but the degree of freedom is n)
③ Variance: V(X) = 2n (where the degrees of freedom are n)
④ χ2(n) / n converges to 1 with n → ∞
⑹ application
① chi-squared distribution table
② probability density function: about 0 <x < ∞ and degree of freedom n,
○ in reality, it’s hard to use the probability density function by hand
○ graph
○ Python programming : Bokeh is used for web-page visualization
# see https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2.html import numpy as np from scipy.stats import chi2 from bokeh.plotting import figure, output_file, show output_file("chi_squared_distribution.html") df = 55 x = np.linspace(0, 100, 300) y = chi2.pdf(x, df) p = figure(width = 400, height = 400, title = "Chi-squared Distribution", tooltips=[("x", "$x"), ("y", "$y")] ) p.line(x, y, line_width = 2) show(p)
③ code for R studio
qchisq(0.95, 1) # [1] 3.841459 qchisq(0.99, 1) # [1] 6.634897 chi_square <- seq(0, 10) dchisq(chi_square, 1) # density function # [1] Inf 0.2419707245 0.1037768744 0.0513934433 0.0269954833 # [6] 0.0146449826 0.0081086956 0.0045533429 0.0025833732 0.0014772828 # [11] 0.0008500367 df <- matrix(c(38, 14, 11, 51), ncol = 2, dimnames = list(hair = c("Fair", "Dark"), eye = c("Blue", "Brown"))) df_chisq <- chisq.test(df) attach(df_chisq) p.value # [1] 8.700134e-09
4. Student’s t-distribution
⑴ definition : when Z ~ N(0, 1), Y ~ χ2(n), the probability distribution of the following random variable
⑵ meaning
① normal distribution needs to know the variance of the population
② in reality, we don’t know the variance of the population, so we use sample variance
③ the distribution of sample mean when we use sample variance in interval estimate is exactly t-distribution
⑶ characteristic
① symmetry
② T-distribution is fatter than standard normal distribution
degree of freedom | confidence interval |
---|---|
4 | ± 3.182 |
60 | ± 2.001 |
200 | ± 1.972 |
1000 | ± 1.962 |
∞ | ± 1.96 |
⑷ application
① t-distribution table
② probability density function: about -∞ < x < ∞ and degree of freedom n,
○ in reality, it is difficulat to use the probability density function by hand
○ graph
○ Python programming : Bokeh is used for web-page visualization
# see https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html import numpy as np from scipy.stats import t from bokeh.plotting import figure, output_file, show output_file("t_distribution.html") df = 2.74 x = np.linspace(-7, 7, 300) y = t.pdf(x, df) p = figure(width = 400, height = 400, title = "Student's t Distribution", tooltips=[("x", "$x"), ("y", "$y")] ) p.line(x, y, line_width = 2) show(p)
5. Snedecor’s F-distribution
⑴ definition: when U ~ χ2(n), V ~ χ2(m), the probability distribution of the following random variable
⑵ meaning
⑶ characteristics
① characteristic 1. if X ~ F(n, m), 1 / X ~ F(m, n) is established
② characteristic 2. if X ~ F(n, m), E(X) = m / (m - 2) (assuming, m > 2) is established
③ characteristic 3. if X ~ F(n, m), VAR(X) = 2m2(n + m - 2) ÷ n(m - 2)2(m - 4) (assuming, m > 4) is established
④ characteristic 4. F(1, n) = T2(n)
⑤ characteristic 5. F(n, ∞) = χ2(m) / m
○ reason: χ2(n) / n converges into 1 if n → ∞
⑷ application
① F-distribution table
② probability density function: about 0 < x < ∞ and degree of freedom of n, m (assuming F(n, m)),
○ in reality, it’s hard to use the probability density function by hand
○ graph
○ Python programming : Bokeh is used for web-page visualization
# see https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f.html import numpy as np from scipy.stats import f from bokeh.plotting import figure, output_file, show output_file("f_distribution.html") dfn, dfd = 29, 18 x = np.linspace(0, 6, 300) rv = f(dfn, dfd) y = rv.pdf(x) p = figure(width = 400, height = 400, title = "F Distribution", tooltips=[("x", "$x"), ("y", "$y")] ) p.line(x, y, line_width = 2) show(p)
Input : 2019.06.19 13:42