Korean, Edit

Chapter 11. Sample Group and Sample Distribution

Higher category : 【Statistics】 Statistics Overview


1. term

2. characteristic of sample groups

3. chi-squared distribution

4. Student’s t-distribution

5. Snedecor’s F-distribution



1. term 

⑴ population: the entire group of interest

⑵ survey

① complete enumeration: to investigate the entire population. it is expensive

② sample enumeration: to survey part of the population

⑶ sample survey

① representative sample: a sample that well reflects the characteristics of the population

② purposive sampling: a sample in which the subjectivity of the investigator intervened to represent the population

③ random sampling: samples without subjectivity of the investigator

○ if each sample has the same probability,

characteristic 1. identically distributed

characteristic 2. independently distributed: independence exists between samples

○ the two characteristics are called independently identically distributed (i.i.d) and are important advantages of random sampling



2. characteristic of sample groups

⑴ random sample: when randomly extract n samples of X1, ···, Xn,

① each sample is independent

② each sample has the same probability distribution

③ E(Xi) = E(X) = m

④ VAR(Xi) = VAR(X) = σ2 

⑵ the relationship between the sample group and population

① about the population mean μ, population variance σ2

② sample mean


drawing


sample covariance 


drawing


④ sample correlation : similar to the definition of Pearson correlation coefficient ρ(x, y)


drawing


| rXY | ≤ 1 

rXY = 1 ⇔ Yi = aXi + b, a > 0

rXY = -1 ⇔ Yi = aXi + b, a < 0 

⑶ the introduction of a new random variable : sample mean

① the average of sample mean


drawing


② the variance of sample mean


drawing


central limit theorem

① definition : normal distribution approximation of binomial distribution 

② generalization : for X with any probability distribution, the sample mean of X can be approximated to normal distribution if n is large enough



3. chi-squared distribution 

⑴ Overview

① Sample distribution when the sample statistic is a sample variance

② A distribution obtained by squaring each of the n independent standard normal random variables and then summing them up

③ Special form of gamma distribution with λ = 1/2, r = n/2


drawing


meaning 1. distribution of random variables related to sample variance


drawing


lemma 1. if Z ~ N(0, 1), Y = Z2 ~ χ2(1) 


drawing


lemma 2. since (Xi - μ) / σ ~ N(0, 1), its square follows χ2(1) 


drawing


lemma 3. if Zi ~ N(0, 1), W = ∑Zi2 ~ χ2(n) 


drawing


lemma 4. probability distribution of ⑵ when the population mean is known


drawing


lemma 5. expansion of the random variable of ⑵


drawing


lemma 6. A and C are independent: since A and C follow normal distribution, it is a necessary and sufficient condition with COV(A, C) = 0 

○ COV(Xi - -Xavg, Xavg) = 0 : intuitively, the remainder of Xi that cannot be explained by Xavg is independent with Xavg itself


drawing


lemma 7. since ψA(t) × ψC(t) = ψB(t) ( A and C are independet), A ~ χ2(n-1)


drawing


⑶ meaning 2. exponential distribution and chi-squared distribution


drawing


⑷ degree of freedom

① First used in a chi-square distribution

② χ2(n) is a chi-square distribution with n degrees of freedom

③ Asymmetric shape skewed to the left as the degree of freedom n is smaller

④ The degree of freedom is in the form of a single rod from n ≥ 3, and the larger the value, the closer to the normal distribution

⑸ characteristic

① χ2(1) = Z(0, 1)2

② Expected value: E(X) = n (but the degree of freedom is n)

③ Variance: V(X) = 2n (where the degrees of freedom are n)

④ χ2(n) / n converges to 1 with n → ∞


drawing


⑹ application

① chi-squared distribution table  


drawing


Table. 1. Chi-squared distribution


② probability density function: about 0 <x < ∞ and degree of freedom n, 


drawing


○ in reality, it’s hard to use the probability density function by hand

○ graph


drawing


Figure. 1. probability density function of chi-squared distribution function at degree of freedom of 55


○ Python programming : Bokeh is used for web-page visualization


# see https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2.html

import numpy as np
from scipy.stats import chi2
from bokeh.plotting import figure, output_file, show

output_file("chi_squared_distribution.html")

df = 55

x = np.linspace(0, 100, 300)
y = chi2.pdf(x, df)

p = figure(width = 400, height = 400, title = "Chi-squared Distribution", 
               tooltips=[("x", "$x"), ("y", "$y")] )
p.line(x, y, line_width = 2)
show(p)


③ code for R studio 

qchisq(0.95, 1)
# [1] 3.841459
qchisq(0.99, 1)
# [1] 6.634897
chi_square <- seq(0, 10) dchisq(chi_square, 1) # density function
# [1] Inf 0.2419707245 0.1037768744 0.0513934433 0.0269954833
# [6] 0.0146449826 0.0081086956 0.0045533429 0.0025833732 0.0014772828
# [11] 0.0008500367
df <- matrix(c(38, 14, 11, 51), ncol = 2, dimnames = list(hair = c("Fair", "Dark"), eye = c("Blue", "Brown"))) 
df_chisq <- chisq.test(df)
attach(df_chisq)
p.value
# [1] 8.700134e-09



4. Student’s t-distribution

⑴ definition : when Z ~ N(0, 1), Y ~ χ2(n), the probability distribution of the following random variable


drawing


⑵ meaning


drawing


① normal distribution needs to know the variance of the population

② in reality, we don’t know the variance of the population, so we use sample variance

③ the distribution of sample mean when we use sample variance in interval estimate is exactly t-distribution

⑶ characteristic 

① symmetry

② T-distribution is fatter than standard normal distribution 


degree of freedom confidence interval
4 ± 3.182
60 ± 2.001
200 ± 1.972
1000 ± 1.962
± 1.96
Table. 2. 95% confidence interval of the t-distribution


⑷ application 

① t-distribution table


drawing


Table. 3. t-distribution table


② probability density function: about -∞ < x < ∞ and degree of freedom n, 


drawing


○ in reality, it is difficulat to use the probability density function by hand

○ graph


drawing


Figure. 2. t distribution at degree of freedom of 2.74


○ Python programming : Bokeh is used for web-page visualization


# see https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html

import numpy as np
from scipy.stats import t
from bokeh.plotting import figure, output_file, show

output_file("t_distribution.html")

df = 2.74

x = np.linspace(-7, 7, 300)
y = t.pdf(x, df)

p = figure(width = 400, height = 400, title = "Student's t Distribution", 
               tooltips=[("x", "$x"), ("y", "$y")] )
p.line(x, y, line_width = 2)
show(p)



5. Snedecor’s F-distribution 

⑴ definition: when U ~ χ2(n), V ~ χ2(m), the probability distribution of the following random variable 


drawing


⑵ meaning


drawing


⑶ characteristics

characteristic 1. if X ~ F(n, m), 1 / X ~ F(m, n) is established

characteristic 2. if X ~ F(n, m), E(X) = m / (m - 2) (assuming, m > 2) is established

characteristic 3. if X ~ F(n, m), VAR(X) = 2m2(n + m - 2) ÷ n(m - 2)2(m - 4) (assuming, m > 4) is established

characteristic 4. F(1, n) = T2(n)

characteristic 5. F(n, ∞) = χ2(m) / m 

○ reason: χ2(n) / n converges into 1 if n → ∞

⑷ application

① F-distribution table


drawing


Table. 4. F-distribution table (α: 0.01)


drawing


Table. 5. F-distribution table (α : 0.025)


drawing


Table. 6. F-distribution table (α : 0.05)


② probability density function: about 0 < x < ∞ and degree of freedom of n, m (assuming F(n, m)), 


drawing


○ in reality, it’s hard to use the probability density function by hand 

○ graph


drawing


Figure. 3. F distribution (degree of freedom: numerator = 29, denominator = 18)


○ Python programming : Bokeh is used for web-page visualization


# see https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f.html

import numpy as np
from scipy.stats import f
from bokeh.plotting import figure, output_file, show

output_file("f_distribution.html")

dfn, dfd = 29, 18

x = np.linspace(0, 6, 300)
rv = f(dfn, dfd)
y = rv.pdf(x)

p = figure(width = 400, height = 400, title = "F Distribution", 
               tooltips=[("x", "$x"), ("y", "$y")] )
p.line(x, y, line_width = 2)
show(p)



Input : 2019.06.19 13:42

results matching ""

    No results matching ""