Chapter 20. Information Theory
Higher category : 【Statistics】 Statistics Table of Contents
1. Information Theory
⑴ Overview
① low probability event: high information (surprising)
② high probability event: low information (unsurprising)
⑵ entropy
① Problem situation: X is Raining / Not Raining. Y is Cloudy / Not Cloudy
Figure. 1. Entropy example
② Basic concepts of statistics
③ information: everything obtained when the result is known
④ uncertainty (surprisal, unexpectedness, randomness)
○ Definition 1. Absence of information about the outcome
○ Definition 2. Number of bits required to represent the given number of cases
○ When the base is 2, the unit used is bits
○ When the base is e (natural logarithm), the unit used is nats
⑤ entropy (self information)
○ Understanding
○ In other words, entropy is defined as the average value of uncertainty
○ Similar to entropy in physics, higher values are obtained as disorder increases: higher entropy means higher heterogeneity
○ It applies the surprisal (-log p(x)) of each event to the probability weight (p(x)) of each event
○ Examples
○ Fair coin toss : H = -0.5 log2 0.5 - 0.5 log2 0.5 = 1
○ Fair die **: H = 6 × (-(1/6) log2 (1/6)) = log2 6 = 2.58
○ If X follows a uniform distribution, H(X) = log n holds
○ cross entropy
○ Definition: Average number of bits required to distinguish two probability distributions, p and q
○ In machine learning, model prediction
○ Feature 1. H(X) is shift-invariant
○ When Y = X + a, H(Y) = H(X) is established because pY(y) = pX(x)
○ Feature 2. H(X) ≥ 0
○ Feature 3. Hb(X) = (logba) Ha(X) (where a and b are under the log)
○ ○ logbp = (logba) logap
⑥ joint entropy
○ Feature 1. If X and Y are independent, H(X, Y) = H(X) + H(Y)
⑦ Conditional entropy
○ Conditional entropy for a specific condition
○ Total conditional entropy
○ Information gain (IG): The increase in order when knowing the information X
○ Information bottleneck method (IB) : Describing the trade-off between complexity and accuracy using information theory
○ Application : For example, it can determine the number of clusters in given data.
○ Feature 1. If X and Y are independent, H(Y | X) = H(Y)
○ Feature 2. H(Y | Y) = 0
○ Feature 3. H(X, Y) = H(X) + H(Y | X) = H(Y) + H(X | Y) (Chain rule)
○ Feature 4. H(X, Y | Z) = H(X | Z) + H(Y | X, Z)
⑧ Relative entropy (Kullback Leibler distance, KL distance)
○ Overview
○ A measure of how two probability distributions diverge.
○ It indicates the divergence between an approximate distribution p(x), which aims to mimic the true distribution, and the true distribution q(x).
○ Typically, the true distribution is a complex probability distribution function, while the approximate distribution is represented using a parameterized model.
○ The approximate distribution can be made manageable, often resembling a simple distribution like the normal distribution.
○ Mutual information
○ Information gain and mutual information are equivalent.
○ Mutual information can be used for assessing image similarity.
○ Feature 1. asymmetry : DKL(p(x) q(x)) ≠ DKL(q(x) p(x))
○ Feature 2. D(p || q) ≥ 0 (equality condition: p(x) = q(x))
○ Feature 3. Mutual information ≥ 0 (equality condition: X and Y are independent)
○ Feature 4. H(X) ≤ log n (where n is the number of elements X can take)
○ Feature 5. H(X | Y) ≤ H(X) (equality condition: X and Y are independent)
○ Application. Determining the predictive data distribution p with the smallest KL distance to the true unknown data distribution q.
○ Maximizing log-likelihood minimizes KL distance.
⑨ Asymptotic equipartition property (AEP)
⑶ Comparison 1. Gini impurity
① Entropy is not the unique measure, alternative measures like Gini impurity are proposed.
② Gini impurity considers the probability of mislabeling (1 - P(xi)) weighted by the weight of each sample (P(xi)).
③ Both Gini impurity and entropy increase as the given data becomes more disordered: Origin of impurity.
④ (Conceptual Distinction) The Gini Coefficient in Economics
○ The concept was first introduced to evaluate income inequality: It can assess not only income but also the degree of inequality in the distribution of a discrete variable.
○ As the Gini coefficient moved from economics to the fields of machine learning and information theory, it implies similar inequality, but the mathematical definition changes.
○ Difference: If the distribution of the variable is entirely uniform,
○ In economics, the Gini coefficient becomes extremely egalitarian, so it has a value of 0.
○ In information theory, the Gini coefficient attains its maximum value due to the Cauchy-Schwarz inequality. Intuitively, it becomes most disordered, thus reaching the maximum value.
⑷ Comparison 2. Connectivity index (CI)
① In a spatial lattice, if a specific spot and its nearest neighbor are in different clusters, the CI value increases.
② Therefore, a higher CI value implies higher heterogeneity.
⑸ Comparison 3. Simpson index
① Another measure based on information theory, along with Shannon entropy
② Based on spatial transcriptomics, the probability of two random spots belonging to the same cluster (ref)
③ The lower the Simpson index, the higher the heterogeneity implied
⑹ Comparison 4. Cluster modularity
① Based on spatial transcriptomics, a metric indicating the purity (homogeneity) of spots within a cluster (ref)
⑺ Comparison 5. Lempel-Ziv complexity
⑻ Comparison 6. Silhouette coefficient
⑼ Comparison 7. Calinski-Harabasz index
⑽ Application 1. Markov process
① two-state Markov chain
Figure. 2. two-scale Markov chain
② Hidden Markov model
○ If χ = {Xi} is a Markov process and Yi = ϕ(Xi) (where ϕ is a deterministic function), then y = {Yi} is a hidden Markov model.
③ Feature 1. If Pr(xn | xn-1) is independent of n, the Markov process is stationary (time-invariant).
④ Feature 2. Aperiodic ⊂ irreducible.
⑤ Feature 3. Markov process can prove the second law of thermodynamics (law of increasing entropy).
○ Assumes uniform stationary distribution.
2. Modern Information Theory
⑴ History
① Ludwig Boltzmann (1872) : Proposed the H-theorem to explain the entropy of gas particles.
② Harry Nyquist (1924) : Quantified the speed at which “intelligence” is propagated through a communication system.
③ John von Neumann (1927): Extended Gibbs free energy to quantum mechanics.
④ Ralph Hartley (1928): Expressed the number of distinguishable information by using logarithms.
⑤ Alan Turing (1940): Introduced deciban for decrypting the German Enigma cipher.
⑥ Richard W. Hamming (1947): Developed Hamming code.
⑦ Claude E. Shannon (1948): Pioneered information theory. A Mathematical Theory of Communication
⑧ Solomon Kullback & Richard Leibler (1951): Introduced Kullback-Leibler divergence
⑨ David A. Huffman (1951): Developed Huffman encoding.
⑩ Alexis Hocquenghem & Raj Chandra Bose & Dwijendra Kumar Ray-Chaudhuri (1960): Developed BCH code.
⑪ Irving S. Reed & Gustave Solomon (1960): Developed Reed-Solomon code.
⑫ Robert G. Gallager (1962): Introduced low-density parity check.
⑬ Kolmogorov (1963): Introduced Kolmogorov complexity (minimum description length).
⑭ Abraham Lempel & Jacob Ziv (1977): Developed Lempel-Ziv compression (LZ77).
⑮ Claude Berrou & Alain Glavieux & Punya Thitimajshima (1993): Developed Turbo code.
⑯ Erdal Arıkan (2008): Introduced polar code.
⑵ Type 1. Encryption
① Huffman code
② Shannon code
③ Shannon-Fano-Elias code (alphabetic code)
⑶ Type 2. Compression
① Uniquely decodable (UD)
② Kraft inequality
⑷ Type 3. Network information theory
⑸ Type 4. Quantum information theory
*Input: 2021-11-10 22:28*
*Modified: 2023-03-21 16:22*