Chapter 21. Information Theory
Higher category : 【Statistics】 Statistics Table of Contents
1. Information Theory
⑴ Overview
① low probability event: high information (surprising)
② high probability event: low information (unsurprising)
⑵ entropy
① Problem situation: X is Raining / Not Raining. Y is Cloudy / Not Cloudy
Figure. 1. Entropy example
② Basic concepts of statistics
③ information: everything obtained when the result is known
④ uncertainty (surprisal, unexpectedness, randomness)
○ When the base is 2, the unit used is bits
○ When the base is e (natural logarithm), the unit used is nats
⑤ entropy (self information, Shannon entropy)
○ Understanding
○ In other words, entropy is defined as the average value of uncertainty
○ Similar to entropy in physics, higher values are obtained as disorder increases: higher entropy means higher heterogeneity
○ It applies the surprisal (-log p(x)) of each event to the probability weight (p(x)) of each event
○ Examples
○ Fair coin toss : H = -0.5 log2 0.5 - 0.5 log2 0.5 = 1
○ Fair die **: H = 6 × (-(1/6) log2 (1/6)) = log2 6 = 2.58
○ If X follows a uniform distribution, H(X) = log n holds
○ Source Coding Theorem
○ Definition: For a very long data sequence, if the compression rate is maximized, the average number of bits of the compressed data approaches the entropy.
○ In other words, it refers to the number of bits required to represent a given set of possibilities.
○ When stored in base log2, it represents the number of bits; when stored in base logk, it indicates the required storage capacity in a k-ary system.
○ Kraft’s Inequality: A constraint that ensures every entity can be represented by a unique bit sequence.
○ Proof of the Source Coding Theorem: For the average number of bits p1L1 + ··· + pnLn,
○ Feature 1. H(X) is shift-invariant
○ When Y = X + a, H(Y) = H(X) is established because pY(y) = pX(x)
○ Feature 2. H(X) ≥ 0
○ Feature 3. Hb(X) = (logba) Ha(X) (where a and b are under the log)
○ logbp = (logba) logap
import numpy as np
def shannon_index(data):
# Count unique values and their occurrences in the data
values, counts = np.unique(data, return_counts=True)
# Calculate proportions using the counts
proportions = counts / sum(counts)
# Calculate Shannon index, considering only non-zero proportions
return -np.sum(proportions * np.log(proportions))
# Example data, where '1' appears 3 times, '2' appears 2 times, and '3' appears 1 time
data = [1, 1, 1, 2, 2, 3]
# Calculate the Shannon index
index = shannon_index(data)
print(f"Shannon index: {index}")
⑥ joint entropy
○ Feature 1. H(X, Y) ≤ H(X) + H(Y): If X and Y are independent, H(X, Y) = H(X) + H(Y)
⑦ Conditional entropy
Figure 2. Conditional entropy visualization
○ Conditional entropy for a specific condition
○ Total conditional entropy
○ Information gain (IG): The increase in order when knowing the information X
○ Information bottleneck method (IB) : Describing the trade-off between complexity and accuracy using information theory
○ Application : For example, it can determine the number of clusters in given data.
○ Feature 1. If X and Y are independent, H(Y | X) = H(Y) (cf. H(Y | X) ≤ H(Y))
○ Feature 2. H(Y | Y) = 0
○ Feature 3. H(X, Y) = H(X) + H(Y | X) = H(Y) + H(X | Y) (Chain rule)
○ Feature 4. H(X, Y | Z) = H(X | Z) + H(Y | X, Z)
⑧ Mutual information
Figure 3. Mutual information visualization
○ Information gain and mutual information are equivalent.
○ Feature 1. I(X; Y) = H(Y) - H(Y | X) = H(X) - H(X | Y)
○ Feature 2. Mutual information ≥ 0 (equality condition: X and Y are independent)
○ Feature 3. H(X | Y) ≤ H(X) (equality condition: X and Y are independent)
○ Feature 4. I(X; Y) ≤ H(X, Y): H(X, Y) - I(X; Y) satisfies the definition of mathematical distance.
○ Application 1. Mutual information can be used for assessing image similarity.
○ Application 2. mRMR (maximum relevance minimum redundancy): A technique used to select features that have high mutual information with the output while having low mutual information among themselves. It is commonly used in machine learning.
⑨ Relative entropy (Kullback Leibler divergence, KL divergence, KLD)
○ Overview
○ A measure of how two probability distributions diverge.
○ It indicates the divergence between an approximate distribution p(x), which aims to mimic the true distribution, and the true distribution q(x).
○ Typically, the true distribution is a complex probability distribution function, while the approximate distribution is represented using a parameterized model.
○ The approximate distribution can be made manageable, often resembling a simple distribution like the normal distribution.
○ Strictly speaking, it does not satisfy the definition of mathematical distance.
○ Feature 1. asymmetry : DKL(p(x) || q(x)) ≠ DKL(q(x) || p(x)) (∴ violates the definition of mathematical distance)
○ Feature 2. D(p || q) ≥ 0 (equality condition: p(x) = q(x))
○ Feature 3. H(X) ≤ log n (where n is the number of elements X can take)
○ Application. Determining the predictive data distribution p with the smallest KL distance to the true unknown data distribution q.
○ Maximizing log-likelihood minimizes KL distance.
○ Application. Jensen-Shannon divergence (JSD)
⑩ cross entropy
○ Definition: Average number of bits required to distinguish two probability distributions, p and q.
○ In machine learning, model prediction.
○ Feature 1. -∑ pi log qi = H(p) + DKL(p || q)
○ Feature 2. Minimizing DKL(p || q) is the same as minimizing cross-entropy because H(p) is constant.
○ Feature 3. When minimizing -∑ pi log qi under the assumption that pi follows a uniform distribution, this is called maximum likelihood estimation.
⑪ Asymptotic equipartition property (AEP)
⑶ Comparison 1. Rényi Entropy
① Hartley (max) entropy: For a source represented by M symbols (e.g., if expressed with the alphabet, M = 26)
② Min entropy: Related to signal processing
③ Rényi entropy: Encompasses Shannon entropy (α → 1; by L’Hôpital’s rule), max entropy (α → 0), and min entropy (α → ∞)
④ Rényi divergence: As α → ∞, Rényi divergence converges to KL divergence (∵ L’Hôpital’s rule)
⑷ Comparison 2. Sample Entropy
① A dynamic measure that reflects time
② Important in signal processing
⑸ Comparison 3. Gini impurity
① Entropy is not the unique measure, alternative measures like Gini impurity are proposed.
② Gini impurity considers the probability of mislabeling (1 - P(xi)) weighted by the weight of each sample (P(xi)).
③ Both Gini impurity and entropy increase as the given data becomes more disordered: Origin of impurity.
④ (Conceptual Distinction) The Gini Coefficient in Economics
○ The concept was first introduced to evaluate income inequality: It can assess not only income but also the degree of inequality in the distribution of a discrete variable.
○ As the Gini coefficient moved from economics to the fields of machine learning and information theory, it implies similar inequality, but the mathematical definition changes.
○ Difference: If the distribution of the variable is entirely uniform,
○ In economics, the Gini coefficient becomes extremely egalitarian, so it has a value of 0.
○ In information theory, the Gini coefficient attains its maximum value due to the Cauchy-Schwarz inequality. Intuitively, it becomes most disordered, thus reaching the maximum value.
⑹ Comparison 4. Connectivity index (CI)
① In a spatial lattice, if a specific spot and its nearest neighbor are in different clusters, the CI value increases.
② Therefore, a higher CI value implies higher heterogeneity.
⑺ Comparison 5. Simpson index
① Another measure based on information theory, along with Shannon entropy
② Based on spatial transcriptomics, the probability of two random spots belonging to the same cluster (ref)
③ The lower the Simpson index, the higher the heterogeneity implied
⑻ Comparison 6. Cluster modularity
① Based on spatial transcriptomics, a metric indicating the purity (homogeneity) of spots within a cluster (ref)
⑼ Comparison 7. Lempel-Ziv complexity
⑽ Comparison 8. Silhouette coefficient
⑾ Comparison 9. Calinski-Harabasz index
2. Variational Inference
⑴ Overview
① The probability that B was the cause when A was observed, denoted as P(B A), is called the posterior probability (cf. Bayes’ theorem).
② Variational Inference: A method to infer a complex posterior distribution by introducing a simpler variational distribution and conducting inference in a way that reduces the difference between the two.
⑵ ELBO (Evidence Lower Bound): The standard method for variational inference
① Jensen’s inequality: For functions like the logarithmic function, which are concave, the following holds:
② Problem situation: It is difficult to directly calculate the log evidence of a model, log p(x), for the given data x and latent variables z.
○ ELBO = 𝔼q(z) [log p(x, z)] - 𝔼q(z) [log q(z)]
○ 𝔼q(z) [log p(x, z)]: The expected value of the joint probability of the model calculated with respect to the variational distribution. It approximates the joint probability of the data x and the latent variables z using the variational distribution.
○ ELBO provides a lower bound for log p(x), the quantity we want to compute.
○ In variational inference, we approximate the complex posterior distribution p(x) by maximizing the ELBO.
④ Example
3. Modern Information Theory
⑴ History
① Ludwig Boltzmann (1872) : Proposed the H-theorem to explain the entropy of gas particles.
② Harry Nyquist (1924) : Quantified the speed at which “intelligence” is propagated through a communication system.
③ John von Neumann (1927): Extended Gibbs free energy to quantum mechanics.
④ Ralph Hartley (1928): Expressed the number of distinguishable information by using logarithms.
⑤ Alan Turing (1940): Introduced deciban for decrypting the German Enigma cipher.
⑥ Richard W. Hamming (1947): Developed Hamming code.
⑦ Claude E. Shannon (1948): Pioneered information theory. A Mathematical Theory of Communication
⑧ Solomon Kullback & Richard Leibler (1951): Introduced Kullback-Leibler divergence
⑨ David A. Huffman (1951): Developed Huffman encoding.
⑩ Alexis Hocquenghem & Raj Chandra Bose & Dwijendra Kumar Ray-Chaudhuri (1960): Developed BCH code.
⑪ Irving S. Reed & Gustave Solomon (1960): Developed Reed-Solomon code.
⑫ Robert G. Gallager (1962): Introduced low-density parity check.
⑬ Kolmogorov (1963): Introduced Kolmogorov complexity (minimum description length).
⑭ Abraham Lempel & Jacob Ziv (1977): Developed Lempel-Ziv compression (LZ77).
⑮ Claude Berrou & Alain Glavieux & Punya Thitimajshima (1993): Developed Turbo code.
⑯ Erdal Arıkan (2008): Introduced polar code.
⑵ Type 1. Encryption
① Huffman code
② Shannon code
③ Shannon-Fano-Elias code (alphabetic code)
⑶ Type 2. Compression
① Uniquely decodable (UD)
② Kraft inequality
⑷ Type 3. Network information theory
⑸ Type 4. Quantum information theory
Input: 2021-11-10 22:28
Modified: 2023-03-21 16:22