Chapter 11. Sample Group and Sample Distribution

Higher category : 【Statistics】 Statistics Overview

1. term

2. characteristic of sample groups

3. chi-squared distribution

4. Student’s t-distribution

5. Snedecor’s F-distribution

1. term

⑴ population: the entire group of interest

⑵ survey

① complete enumeration: to investigate the entire population. it is expensive

② sample enumeration: to survey part of the population

⑶ sample survey

① representative sample: a sample that well reflects the characteristics of the population

② purposive sampling: a sample in which the subjectivity of the investigator intervened to represent the population

③ random sampling: samples without subjectivity of the investigator

○ if each sample has the same probability,

○ characteristic 1. identically distributed

○ characteristic 2. independently distributed: independence exists between samples

○ the two characteristics are called independently identically distributed (i.i.d) and are important advantages of random sampling

2. characteristic of sample groups

⑴ random sample: when randomly extract n samples of X₁, ···, X_n,

① each sample is independent

② each sample has the same probability distribution

③ E(X_i) = E(X) = m

④ VAR(X_i) = VAR(X) = σ²

⑵ The relationship between the sample group and population

① About the population mean μ, population variance σ²,

② Sample mean

③ Sample variance

④ sample correlation : similar to the definition of Pearson correlation coefficient ρ(x, y)

○ | r_XY | ≤ 1

○ r_XY = 1 ⇔ Y_i = aX_i + b, a ＞ 0

○ r_XY = -1 ⇔ Y_i = aX_i + b, a ＜ 0

⑶ the introduction of a new random variable : sample mean

① the average of sample mean

② the variance of sample mean

⑷ central limit theorem

① definition : normal distribution approximation of binomial distribution

② generalization : for X with any probability distribution, the sample mean of X can be approximated to normal distribution if n is large enough

3. chi-squared distribution

⑴ Overview

① Sample distribution when the sample statistic is a sample variance

② A distribution obtained by squaring each of the n independent standard normal random variables and then summing them up

③ Special form of gamma distribution with λ = 1/2, r = n/2

⑵ meaning 1. distribution of random variables related to sample variance

① lemma 1. if Z ~ N(0, 1), Y = Z² ~ χ²(1)

② lemma 2. since (X_i - μ) / σ ~ N(0, 1), its square follows χ²(1)

③ lemma 3. if Z_i ~ N(0, 1), W = ∑Z_i² ~ χ²(n)

④ lemma 4. probability distribution of ⑵ when the population mean is known

⑤ lemma 5. expansion of the random variable of ⑵

⑥ lemma 6. A and C are independent: since A and C follow normal distribution, it is a necessary and sufficient condition with COV(A, C) = 0

○ COV(X_i - -X_avg, X_avg) = 0 : intuitively, the remainder of Xi that cannot be explained by Xavg is independent with Xavg itself

⑦ lemma 7. since ψ_A(t) × ψ_C(t) = ψ_B(t) (∵ A and C are independet), A ~ χ²(n-1)

⑶ meaning 2. exponential distribution and chi-squared distribution

⑷ degree of freedom

① First used in a chi-square distribution

② χ²(n) is a chi-square distribution with n degrees of freedom

③ Asymmetric shape skewed to the left as the degree of freedom n is smaller

④ The degree of freedom is in the form of a single rod from n ≥ 3, and the larger the value, the closer to the normal distribution

⑸ characteristic

① χ²(1) = Z(0, 1)²

② Expected value: E(X) = n (but the degree of freedom is n)

③ Variance: V(X) = 2n (where the degrees of freedom are n)

④ χ²(n) / n converges to 1 with n → ∞

⑹ application

① chi-squared distribution table

Table. 1. Chi-squared distribution

② probability density function: about 0 ＜x ＜ ∞ and degree of freedom n,

○ in reality, it’s hard to use the probability density function by hand

○ graph

Figure. 1. probability density function of chi-squared distribution function at degree of freedom of 55

○ Python programming : Bokeh is used for web-page visualization

# see https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2.html

import numpy as np
from scipy.stats import chi2
from bokeh.plotting import figure, output_file, show

output_file("chi_squared_distribution.html")

df = 55

x = np.linspace(0, 100, 300)
y = chi2.pdf(x, df)

p = figure(width = 400, height = 400, title = "Chi-squared Distribution", 
               tooltips=[("x", "$x"), ("y", "$y")] )
p.line(x, y, line_width = 2)
show(p)

③ code for R studio

qchisq(0.95, 1)
# [1] 3.841459
qchisq(0.99, 1)
# [1] 6.634897
chi_square <- seq(0, 10) dchisq(chi_square, 1) # density function
# [1] Inf 0.2419707245 0.1037768744 0.0513934433 0.0269954833
# [6] 0.0146449826 0.0081086956 0.0045533429 0.0025833732 0.0014772828
# [11] 0.0008500367
df <- matrix(c(38, 14, 11, 51), ncol = 2, dimnames = list(hair = c("Fair", "Dark"), eye = c("Blue", "Brown"))) 
df_chisq <- chisq.test(df)
attach(df_chisq)
p.value
# [1] 8.700134e-09

4. Student’s t-distribution

⑴ definition : when Z ~ N(0, 1), Y ~ χ²(n), the probability distribution of the following random variable

⑵ meaning

① normal distribution needs to know the variance of the population

② in reality, we don’t know the variance of the population, so we use sample variance

③ the distribution of sample mean when we use sample variance in interval estimate is exactly t-distribution

⑶ characteristic

① symmetry

② T-distribution is fatter than standard normal distribution

degree of freedom	confidence interval
4	± 3.182
60	± 2.001
200	± 1.972
1000	± 1.962
∞	± 1.96

Table. 2. 95% confidence interval of the t-distribution

⑷ application

① t-distribution table

Table. 3. t-distribution table

② probability density function: about -∞ ＜ x ＜ ∞ and degree of freedom n,

○ in reality, it is difficulat to use the probability density function by hand

○ graph

Figure. 2. t distribution at degree of freedom of 2.74

○ Python programming : Bokeh is used for web-page visualization

# see https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html

import numpy as np
from scipy.stats import t
from bokeh.plotting import figure, output_file, show

output_file("t_distribution.html")

df = 2.74

x = np.linspace(-7, 7, 300)
y = t.pdf(x, df)

p = figure(width = 400, height = 400, title = "Student's t Distribution", 
               tooltips=[("x", "$x"), ("y", "$y")] )
p.line(x, y, line_width = 2)
show(p)

5. Snedecor’s F-distribution

⑴ definition: when U ~ χ²(n), V ~ χ²(m), the probability distribution of the following random variable

⑵ meaning

⑶ characteristics

① characteristic 1. if X ~ F(n, m), 1 / X ~ F(m, n) is established

② characteristic 2. if X ~ F(n, m), E(X) = m / (m - 2) (assuming, m ＞ 2) is established

③ characteristic 3. if X ~ F(n, m), VAR(X) = 2m²(n + m - 2) ÷ n(m - 2)²(m - 4) (assuming, m ＞ 4) is established

④ characteristic 4. F(1, n) = T²(n)

⑤ characteristic 5. F(n, ∞) = χ²(m) / m

○ reason: χ²(n) / n converges into 1 if n → ∞

⑷ application

① F-distribution table

Table. 4. F-distribution table (α: 0.01)

Table. 5. F-distribution table (α : 0.025)

Table. 6. F-distribution table (α : 0.05)

② probability density function: about 0 ＜ x ＜ ∞ and degree of freedom of n, m (assuming F(n, m)),

○ in reality, it’s hard to use the probability density function by hand

○ graph

Figure. 3. F distribution (degree of freedom: numerator = 29, denominator = 18)

○ Python programming : Bokeh is used for web-page visualization

# see https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f.html

import numpy as np
from scipy.stats import f
from bokeh.plotting import figure, output_file, show

output_file("f_distribution.html")

dfn, dfd = 29, 18

x = np.linspace(0, 6, 300)
rv = f(dfn, dfd)
y = rv.pdf(x)

p = figure(width = 400, height = 400, title = "F Distribution", 
               tooltips=[("x", "$x"), ("y", "$y")] )
p.line(x, y, line_width = 2)
show(p)

Input : 2019.06.19 13:42

1629

Chapter 11. Sample Group and Sample Distribution

1. term

2. characteristic of sample groups

3. chi-squared distribution

4. Student’s t-distribution

5. Snedecor’s F-distribution

results matching ""

No results matching ""