Chapter 13. Statistical Estimation
Higher category : 【Statistics】 Statistics Overview
1. overview
1. overview
⑴ statistical estimation: estimating the characteristics of a population through samples
⑵ state space: ℝn. a set of all the observed samples
2. point estimation (parametric approach, location type)
⑴ definition: estimating parameters from samples (x1, ···, xn)
data:image/s3,"s3://crabby-images/1a1ca/1a1caf3e7f3d4cb998270a13325815a231d08279" alt="drawing"
① parameter : values showing the characteristics of the population. μ, σ, θ, λ, etc
② μ : mean of population
③ σ : standard deviation of population
④ θ : θ: probability of success in Bernoulli distribution or binomial distribution
⑤ λ : λ of Poisson distribution or exponential distribution
⑵ sampling distribution (empirical distribution)
⑶ point estimator: for parameter θ,
① definition 1. point estimator is not a single number but a function
② definition 2. point estimator is a function for X1, ···, Xn
data:image/s3,"s3://crabby-images/a2976/a2976f6b7e63a90baf6552eea0b0741c272d156b" alt="drawing"
③ definition 3. the probability of a point estimator is a function of θ
data:image/s3,"s3://crabby-images/bb29f/bb29fdbc22185900597f6065261c489f66e8f679" alt="drawing"
⑷ criteria for a good point estimator
① expected error or mean squared error (MSE) : also called model risk
○ bias-variance decomposition
data:image/s3,"s3://crabby-images/10f5f/10f5f10ac2775bd43227eea8dc54ec5bac1417e7" alt="drawing"
○ as the covariance of bias and chance error is 0 intuitively, we can remove the intermediate term
○ the strategy to reduce the bias increases the model variance
○ the strategy to reduce the model variance increases the bias
② criterion 1. bias: also called systemic error, non-random error, and model bias
data:image/s3,"s3://crabby-images/ea77b/ea77b64a71639069e59ec37faa8f8429ce597d30" alt="drawing"
○ small bias required
○ cause: underfitting, lack of domain knowledge
○ solutions: use of more complex models, use of models suitable for domain
○ unbiased estimator: B = 0 ⇔ sample mean = population mean. if there’s unbiasedness, it’s a good estimator
○ example 1. sample mean: unbiased estimator of population mean
data:image/s3,"s3://crabby-images/fbc90/fbc90e4f8518308aa93eb1880bba83b16d802261" alt="drawing"
○ example 2. sample variance : unbiased estimator of population variance
data:image/s3,"s3://crabby-images/352fc/352fc7eef11f031209e30baddf85b3e0fcd4d5d5" alt="drawing"
○ example 3. sample covariance
data:image/s3,"s3://crabby-images/22a9a/22a9a8b10cfad327ead877c24d7cf9f765b64861" alt="drawing"
○ example 4. when Xi ~ u[0, θ], either unbiased estimator or not
data:image/s3,"s3://crabby-images/6b2fe/6b2fecbf4a8e59e77246f32a9a86a76c1ddeec17" alt="drawing"
③ criterion 2. efficiency: related to random chance error
data:image/s3,"s3://crabby-images/8c733/8c733d652cafdc263a7bb7e73fb418e3ba29e95c" alt="drawing"
○ small variance is required based on the premise of unbiased estimator
○ 2-1. noise variance: also called 1st chance error and observation variance. marked as σ2
○ example : error of the instrument itself, noise of the target itself
○ there are attempts to measure these information, but there are many difficulties
○ 2-2. model variance: also called 2nd chance error
data:image/s3,"s3://crabby-images/68fa4/68fa4e2007ee1279b65f3a900a0ad10f7f217eea" alt="drawing"
○ chance error due to the fact that the sample group is a randomly extracted set from the population
○ cause : overfitting
○ solution : using a simpler model
○ bias-variance tradeoff : as model complexity increases, the bias decrease, but model variance increases, resulting in trade-off relationship and optimal complexity
data:image/s3,"s3://crabby-images/0d9fc/0d9fc35e2091b59bd0be620f993d70e0186937d7" alt="drawing"
data:image/s3,"s3://crabby-images/400f5/400f57e345af84d67b7c2e351e3d2517464a897f" alt="drawing"
○ BLUE (best linear unbiased estimator) : the estimator of the smallest variance among linear unbiased estimators
data:image/s3,"s3://crabby-images/98fe3/98fe3a58fcc6403a6a7505d9e844565271d03f0a" alt="drawing"
○ uniformly minimum variance unbiased estimator (UMVU)
○ definition: the estimator of the smallest variance among unbiased estimators including non-linear unbiased estimators
○ the direct calculation of Fisher information In
data:image/s3,"s3://crabby-images/118e5/118e59810fe6ddb9963b0ab6535ce68ab5ed2537" alt="drawing"
○ the indirect calculation of Fisher information In
○ Cramér–Rao lower bound = 1 / In
○ if an estimator is equal to the Cramér–Rao lower bound, it is UMVU
④ criterion 3. consistency and consistent estimator
○ asymptotic property: characteristic of sample, of which size is approaching ∞
○ asymptotic unbiasedness: the case in which the unbiasedness is established when n → ∞. it is related to the law of large numbers
data:image/s3,"s3://crabby-images/ba4c4/ba4c4f406a4029cd7b47fe8d80ea84f759b9a0e9" alt="drawing"
○ asymptotic efficiency
data:image/s3,"s3://crabby-images/7fd5b/7fd5b775cd13751f3301a7be2abd23b773865613" alt="drawing"
○ consistency : the property of the estimator converging into a parameter
data:image/s3,"s3://crabby-images/09774/09774c1b3914eadf1101b58cb66ae1494c6ceadf" alt="drawing"
○ X is a random variable, but generally considered as a specific constant
○ example: the following is a poor random variable because it is inconsistent
data:image/s3,"s3://crabby-images/7ac91/7ac914b3a59a981cf76fe5452339be3d28cf1878" alt="drawing"
⑤ criterion 4. least squared estimator
data:image/s3,"s3://crabby-images/b50e1/b50e13faa167eab32f1e998859be4bc7a618caa7" alt="drawing"
⑸ method 1. discrete probability distribution and maximum probability
① it uses the definition of binomial coefficients
data:image/s3,"s3://crabby-images/5ccfa/5ccfaec56fc2918ee16a87bacdac369aba60e931" alt="drawing"
② example
○ situation: number of members of a population are estimated through marking-and-recapture method
○ given the number of members N, the number of firstly captured members m, the number of lastly captured members n, the number of lastly marked memebers x
○ probability distribution: hypergeometric distribution
data:image/s3,"s3://crabby-images/447e4/447e42280f06b2e51efa197c004f450ff6925163" alt="drawing"
○ question: the most reasonable value of N
data:image/s3,"s3://crabby-images/93a35/93a3502d16c1d17ed372cdca8c38707cbb22a152" alt="drawing"
⑹ method 2. method of moment estimator (MOM): also called sample analog estimation
① definition: the method of calculating the estimator of θ in a way of θˆ =g-1((1/n) × ∑Xik) based on the fact that E(Xk) = g(θ) ⇔ θ = g-1(E(Xk))
○ E(Xk) : moment or population moment
○ (1/n) × ∑Xik : sample moment
○ the moment is a constant, and the sample moment is a random variable with a constant distribution
○ consistency : by the law of large numbers, the sample moment converges into the moment
② sample moment
○ k-th order sample moment for origin
data:image/s3,"s3://crabby-images/336f7/336f7560373495b20923b82ebb85efe74a7ab6ea" alt="drawing"
○ k-th order sample moment for sample mean
data:image/s3,"s3://crabby-images/8e079/8e079a6e74fbb5e8b90666628330d335f5fc893e" alt="drawing"
③ example
data:image/s3,"s3://crabby-images/c7912/c791249b9611a55ec42cd01748479b751decd991" alt="drawing"
⑺ method 3. maximum likelihood method (ML)
① definition
○ θ : parameter
○ θ* : the estimator of the parameter θ
○ θML : the maximum likelihood estimator of the parameter θ
② likelihood: the possibility of something happening
③ likelihood function
○ the probability that a given sample will come out when θ* is given.
○ also known as product of likelihoods
○ that is, p(X | θ*)
○ marked as ℒ
④ log likelihood function: taking a log into the likelihood function
○ marked as ℓ = ln ℒ
⑤ maximum likelihood estimation: examining θML that maximizes the likelihood function p(X | θ)
data:image/s3,"s3://crabby-images/93f27/93f270f9e0355bf5d69b4d72270b5bc64c51fb23" alt="drawing"
○ assumption : the closer the θ* is to parameter θ, the greater the likelihood function will be
○ 1st. differentiation of log likelihood function: acquire θ* that makes the local maximum on a valid interval
○ 2nd. if the local maximum exists: it is assumed that the θ* that makes the local maximum is θML
○ 3rd. if the local maximum doesn’t exist: it is assumed that θ* with higher likelihood of both ends of a valid interval is θML
○ maximum likelihood estimation and Hessian matrix: a useful method for obtaining estimators of all differentiable functions
○ step 1. get the second order approximation by obtaining the Taylor series for θk , and calculate the solution θk+1 = θk + dk that maximize the approximated equation
data:image/s3,"s3://crabby-images/cc0ae/cc0ae84e9a6ee50bad2b476e44b2d240a9e3cd4e" alt="drawing"
data:image/s3,"s3://crabby-images/05a41/05a415ecb94304fb0e475fbddaf069b2905200a9" alt="drawing"
○ step 2. Newton-Raphson method : updating θk will eventually reach the global maximum
○ example : logistic regression
⑥ maximum likelihood estimator : when sample X is given, the function G corresponding to θML that maximizes the likelihood
○ θML = Gℓ (ℓ) = Gℒ(ℒ)
○ limitation of the estimator: there are limits on the assumption of maximum likelihood estimation
○ the estimator favored by statisticians.
⑦ example 1.
data:image/s3,"s3://crabby-images/b7f81/b7f8146a76638469ea7fed23db2b827717410707" alt="drawing"
⑧ example 2.
data:image/s3,"s3://crabby-images/976f0/976f062506b64f7b1612eb1a2322e1ea133eb466" alt="drawing"
⑨ example 3.
data:image/s3,"s3://crabby-images/29bf0/29bf0521be4d64aee070f05f366385e58ea00ced" alt="drawing"
⑩ example 4. the maximum likelihood estimation might not be determined solely
data:image/s3,"s3://crabby-images/a3a66/a3a66cdb90018fbc16c21826a8da8b4b56f9e751" alt="drawing"
⑪ characteristic 1. consistency
data:image/s3,"s3://crabby-images/fd828/fd828a6e6441fb7f1aaa28df495efdfeef19cca9" alt="drawing"
⑫ characteristic 2. asymptotic normal distribution
data:image/s3,"s3://crabby-images/4b9f9/4b9f986b3d24e1da5335461d03369852dc145950" alt="drawing"
⑬ characteristic 3. invariance : if θML is the maximum likelihood estimator of θ, g(θML) is the maximum likelihood estimator of g(θ)
⑭ characteristic 4. the maximum likelihood estimation is a special example of Bayes rule
data:image/s3,"s3://crabby-images/e3f5d/e3f5dd5bc7ad813ae1758aff167103a719f6c71e" alt="drawing"
3. interval estimation (scaling type)
⑴ definition: estimating which interval the parameter is in through the samples
① purpose of introduction: the probability that the point estimator exactly matches the actual parameter is zero
② confidence level (confidence coefficient)
○ P(θleft < θ < θright) = 1 - α, 0 < α < 1
○ threshold: values that constitute the boundary of the confidence interval. θleft, , θright, etc
○ 1 - α : confidence level (confidence coefficient)
○ α : rejection probability or significance level
○ confidence interval: the interval [θleft, θright] , the probability of θ being on which is (1 - α) × 100%
③ notes
○ P(Z > 1.65) = 5% ⇔ P( Z > 1.65) = 10%
○ P(Z > 1.96) = 2.5% ⇔ P( Z > 1.96) = 5%
○ P(Z > 2.58) = 0.5% ⇔ P( Z > 2.58) = 1%
④ 68 - 95 - 99.7 rule
○ μ ± 1 × σ : 68.27 %
○ μ ± 2 × σ : 95.45 %
○ μ ± 3 × σ : 99.73 %
⑵ case 1. when Xi ~ N(μ, σ2) and the population variance σ2 is known
① overview: a normal distribution is used
② method
○ introduction: when μ is known, the probability of Xavg (confidence level : α) is as follows
data:image/s3,"s3://crabby-images/48d92/48d92b352a651836316b08e7331d0a5a195192cc" alt="drawing"
○ change of ideas: Xavg ∈ I (μ) ⇔ μ ∈I (Xavg) (confidence level: α)
○ meaning : it means the probability distribution of μ when Xavg is known
○ it is noted that the probability distribution of μ follows the same conceptual framework of the probability distribution of Xavg when μ is known
○ draw your own picture to confirm
data:image/s3,"s3://crabby-images/6af46/6af46bc226e354dfb21512a2b8331332fcb6dca6" alt="drawing"
data:image/s3,"s3://crabby-images/5bf3e/5bf3ed747f2cd11f76ffd1324546181f00ca1c2b" alt="drawing"
○ pivotal estimation : for the shortest confidence interval, it should be established that |a| = |b|, i.e. a = -zα/2, b = zα/2. here, the proof is omitted
data:image/s3,"s3://crabby-images/fae48/fae48dfc0d372fbb9e7c95edc58e30d8d1c59cfd" alt="drawing"
③ if you know the distribution function
○ example 1. F(x) = √x / θ, 0 ≤ x ≤ θ2 : the 90% confidence interval is as follows
data:image/s3,"s3://crabby-images/dbff0/dbff045c0b9fb2846619310c04400030b19380f0" alt="drawing"
○ example 2. F(x) = (x / θ)n : the 90% confidence interval is as follows
data:image/s3,"s3://crabby-images/85b54/85b54256152b5702b094197dfd847a1cd413c455" alt="drawing"
⑶ case 2. when Xi ~ N(μ, σ2) and the population variance σ2 is unknown
① overview
○ normal distribution needs to know the variance of the population
○ in reality, the sample variance is used because the population variance is unknown
○ the distribution of sample mean when using sample variance instead of population variance is exactly t-distribution
② example 1. sample mean
data:image/s3,"s3://crabby-images/3475c/3475c2a9df45fbedff52b96c109494e5ddc7cc22" alt="drawing"
○ introduction: when μ is known, the probability of Xavg (confidence level : α)
data:image/s3,"s3://crabby-images/ec88b/ec88bd51e096c774fcbfebcf4f695cf0535b0135" alt="drawing"
○ change of ideas : Xavg ∈ I* (μ) ⇔ μ ∈ I* (Xavg) (confidence level : α)
data:image/s3,"s3://crabby-images/39872/3987292db85ec138dec1d5a45f459e398fbcd16b" alt="drawing"
○ pivotal estimation: for the shortest confidence interval, it should be established that |a| = |b|, i.e. a = - tα/2, b = tα/2. here, the proof is omitted
data:image/s3,"s3://crabby-images/c1154/c1154f650ca9d38e6a2deb64abe203b0feef0b92" alt="drawing"
③ example 2. (case 1) when Xi (μX, σ2) (i = 1, ···, n) and Yj (μY, σ2) (i = 1, ···, n) are paired
○ also called paired estimation (matched sample estimation)
○ in fact, there is only one variable: after defining Wi = Xi - Yi, manipulate example 1
data:image/s3,"s3://crabby-images/ce9c1/ce9c17d8af3cde6991cdfe6bea4b52e388e1e606" alt="drawing"
○ an example of a paired sample
data:image/s3,"s3://crabby-images/a07c5/a07c5a502c41713aee31697640c9a65da09c3718" alt="drawing"
○ an example of an independent sample
data:image/s3,"s3://crabby-images/fc2e3/fc2e3c7bc3d867922363ab1c05123e50146a07d3" alt="drawing"
④ example 3. (case 2) difference of two sample means : when Xi (μX, σ2) (i = 1, ···, n) and Yj (μY, σ2) (j = 1, ···, m) are independent
○ when the variances of two sample means are same in unpaired sample estimation (pooled sample estimation)
○ formula
data:image/s3,"s3://crabby-images/100f5/100f5ede13ced3760211d7a9bfbbd77a7d22c983" alt="drawing"
○ confidence interval for confidence level α
data:image/s3,"s3://crabby-images/9d4bc/9d4bc6460103037930b5c45b8bc45acf28c2d6d2" alt="drawing"
⑤ example 4. (case 3) difference of sample means: when Xi (μX, σX2) (i = 1, ···, n) and Yj (μY, σY2) (j = 1, ···, m) are independent (assuming σX ≠ σY)
○ when the variances of two sample means are different in unpaired sample estimation (pooled sample estimation)
○ Welch approach is used
data:image/s3,"s3://crabby-images/285f8/285f8fc90825970444b7261907558c5f2c500dfa" alt="drawing"
○ degree of freedom in (case 3) is lower than (case 2) → power of test decreases
○ the formula of ν is very complex
data:image/s3,"s3://crabby-images/6f4aa/6f4aaae2868972ff650b941bf4e97cc632dc12db" alt="drawing"
⑥ example 5. confidence interval of population variance
○ note that there is no obvious solution in minimizing the size of the confidence interval: numerical analysis should be used
○ the given model
data:image/s3,"s3://crabby-images/b0526/b052673fd323bb0bf0eb608d43811e2ebef5b3ad" alt="drawing"
○ confidence interval for confidence level α
data:image/s3,"s3://crabby-images/633d8/633d8e02dc0b6b9e49e0ba611c02a9b5604b2c47" alt="drawing"
⑦ example 6. ratio of population variance
data:image/s3,"s3://crabby-images/8fa17/8fa17ce098ed9c0ce280a5f27dadbaf4fd3a95b2" alt="drawing"
○ confidence interval for confidence level α
data:image/s3,"s3://crabby-images/6984f/6984f3b37063be185a14bf5efa5e7a4ab25a34bb" alt="drawing"
⑷ case 3. when samples do not follow normal distribution, but there are many samples
① central limit theorem : if n is large enough, the distribution of sample mean converges into normal distribution
○ formula
data:image/s3,"s3://crabby-images/cfcd4/cfcd4a612dcb1588c1700df06741e9dc28e97e13" alt="drawing"
○ t distribution eventually converges into normal distribution
② number of samples
○ normality is typically achieved with only 25 ~ 30 samples
○ for a symmetric unimodal distribution (having one extreme value), n = 5 is sufficient
③ example 1. population ratio
○ given model
data:image/s3,"s3://crabby-images/0696d/0696dc22923e67b0418ec6f2106dfffadaaf89de" alt="drawing"
○ confidence interval for confidence level α
data:image/s3,"s3://crabby-images/b68fe/b68fe3546d31ee5d5ab9896317eeb4f819b57635" alt="drawing"
④ example 2. correlation coefficient
○ null hypothesis H0 : correlation coefficient = 0
○ alternative hypothesis H1 : correlation ceofficient ≠ 0
○ calculation of t statistics: for the correlation coefficient r obtained from the sample,
data:image/s3,"s3://crabby-images/1ee96/1ee967b32fc3ee1dcc2bcd36dd728d8731f5de30" alt="drawing"
○ the above statistic follows the student t distribution with a degree of freedom of n - 2 (assuming the number of samples is n)
Input : 2019.06.19 14:23