Chapter 21. Information Theory

Higher category: 【Statistics】 Statistics Table of Contents

1. Information Theory

2. Variational Inference

3. Modern Information Theory

1. Information Theory

⑴ Overview

① low probability event: high information (surprising)

② high probability event: low information (unsurprising)

⑵ entropy

① Problem situation: X is Raining / Not Raining. Y is Cloudy / Not Cloudy

Figure. 1. Entropy example

② Basic concepts of statistics

③ Information (information content) or Uncertainty (surprisal, unexpectedness, randomness)

○ The amount of mathematical uncertainty measured.

○ When the base is 2, the unit used is bits

○ When the base is e (natural logarithm), the unit used is nats

○ Feature 1. If p(x_i) = 1, I(x_i) = 0

○ Feature 2. I(x_i) ≥ 0

○ Feature 3. If P(x = x_i) < P(x = x_j), I(x_i) > I(x_j)

○ Feature 4. If x_i and x_j are independent, I(x_ix_j) = I(x_i) + I(x_j)

④ entropy (self information, Shannon entropy)

○ Understanding

○ In other words, entropy is defined as the average value of uncertainty

○ Similar to entropy in physics, higher values are obtained as disorder increases: higher entropy means higher heterogeneity

○ It applies the surprisal (-log p(x)) of each event to the probability weight (p(x)) of each event

○ Redundancy = Maximum entropy - Actual entropy

○ Examples

○ Fair coin toss: H = -0.5 log₂ 0.5 - 0.5 log₂ 0.5 = 1

○ Fair die **: H = 6 × (-(1/6) log2 (1/6)) = log₂ 6 = 2.58

○ If X follows a uniform distribution, H(X) = log n holds

○ Source Coding Theorem

○ Definition: For a very long data sequence, if the compression rate is maximized, the average number of bits of the compressed data approaches the entropy.

○ In other words, it refers to the number of bits required to represent a given set of possibilities.

○ When stored in base log2, it represents the number of bits; when stored in base logk, it indicates the required storage capacity in a k-ary system.

○ Kraft’s Inequality: A constraint that ensures every entity can be represented by a unique bit sequence.

○ Proof of the Source Coding Theorem: For the average number of bits p₁L₁ + ··· + p_nL_n,

○ Feature 1. H(X) is shift-invariant

○ When Y = X + a, H(Y) = H(X) is established because p_Y(y) = p_X(x)

○ Feature 2. H(X) ≥ 0

○ Feature 3. H_b(X) = (log_ba) H_a(X) (where a and b are under the log)

○ log_bp = (log_ba) log_ap

○ Python code

import numpy as np

def shannon_index(data):
    # Count unique values and their occurrences in the data
    values, counts = np.unique(data, return_counts=True)
    # Calculate proportions using the counts
    proportions = counts / sum(counts)
    # Calculate Shannon index, considering only non-zero proportions
    return -np.sum(proportions * np.log(proportions))

# Example data, where '1' appears 3 times, '2' appears 2 times, and '3' appears 1 time
data = [1, 1, 1, 2, 2, 3]

# Calculate the Shannon index
index = shannon_index(data)
print(f"Shannon index: {index}")

⑤ joint entropy

○ Feature 1. H(X, Y) ≤ H(X) + H(Y): If X and Y are independent, H(X, Y) = H(X) + H(Y)

⑥ Conditional entropy

Figure 2. Conditional entropy visualization

○ Conditional entropy for a specific condition

○ Total conditional entropy

○ Information gain (IG): The increase in order when knowing the information X

○ Information bottleneck method (IB): Describing the trade-off between complexity and accuracy using information theory

○ Application: For example, it can determine the number of clusters in given data.

○ Feature 1. If X and Y are independent, H(Y | X) = H(Y) (cf. H(Y | X) ≤ H(Y))

○ Feature 2. H(Y | Y) = 0

○ Feature 3. H(X, Y) = H(X) + H(Y | X) = H(Y) + H(X | Y) (Chain rule)

○ Feature 4. H(X, Y | Z) = H(X | Z) + H(Y | X, Z)

⑦ Mutual information

Figure 3. Mutual information visualization

○ Information gain and mutual information are equivalent.

○ Feature 1. I(X; Y) = H(Y) - H(Y | X) = H(X) - H(X | Y)

○ Feature 2. Mutual information ≥ 0 (equality condition: X and Y are independent)

○ Feature 3. H(X | Y) ≤ H(X) (equality condition: X and Y are independent)

○ Feature 4. I(X; Y) ≤ H(X, Y): H(X, Y) - I(X; Y) satisfies the definition of mathematical distance.

○ Application 1. Mutual information can be used for assessing image similarity.

○ Application 2. mRMR (maximum relevance minimum redundancy): A technique used to select features that have high mutual information with the output while having low mutual information among themselves. It is commonly used in machine learning. Matlab code is as follows:

% Rank features using mRMR
selected_features_idx = fscmrmr(features, labels);

⑧ Relative entropy (Kullback Leibler divergence, KL divergence, KLD)

○ Overview

○ A measure of how two probability distributions diverge.

○ It indicates the divergence between an approximate distribution p(x), which aims to mimic the true distribution, and the true distribution q(x).

○ Typically, the true distribution is a complex probability distribution function, while the approximate distribution is represented using a parameterized model.

○ The approximate distribution can be made manageable, often resembling a simple distribution like the normal distribution.

○ Strictly speaking, it does not satisfy the definition of mathematical distance.

○ Feature 1. asymmetry: D_KL(p(x) || q(x)) ≠ D_KL(q(x) || p(x)) (∴ violates the definition of mathematical distance)

Figure 4. The difference between D_KL(p || q) and D_KL(q || p)

○ Feature 2. D(p || q) ≥ 0 (equality condition: p(x) = q(x))

○ Feature 3. H(X) ≤ log n (where n is the number of elements X can take)

○ Application. Determining the predictive data distribution p with the smallest KL distance to the true unknown data distribution q.

○ Maximizing log-likelihood minimizes KL distance.

○ Application. Jensen-Shannon divergence (JSD)

⑨ cross entropy

○ Definition: Average number of bits required to distinguish two probability distributions, p and q.

○ In machine learning, model prediction.

○ Feature 1. -∑ p_i log q_i = H(p) + D_KL(p || q)

○ Feature 2. Minimizing D_KL(p || q) is the same as minimizing cross-entropy because H(p) is constant.

○ Feature 3. When minimizing -∑ p_i log qi under the assumption that p_i follows a uniform distribution, this is called maximum likelihood estimation.

⑩ Asymptotic equipartition property (AEP)

⑶ Comparison 1. Rényi Entropy

① Hartley (max) entropy: For a source represented by M symbols (e.g., if expressed with the alphabet, M = 26)

② Min entropy: Related to signal processing

③ Rényi entropy: Encompasses Shannon entropy (α → 1; by L’Hôpital’s rule), max entropy (α → 0), and min entropy (α → ∞)

④ Rényi divergence: As α → ∞, Rényi divergence converges to KL divergence (∵ L’Hôpital’s rule)

⑷ Comparison 2. Sample Entropy

① A dynamic measure that reflects time

② Important in signal processing

⑸ Comparison 3. Gini impurity

① Entropy is not the unique measure, alternative measures like Gini impurity are proposed.

② Gini impurity considers the probability of mislabeling (1 - P(x_i)) weighted by the weight of each sample (P(x_i)).

③ Both Gini impurity and entropy increase as the given data becomes more disordered: Origin of impurity.

④ (Conceptual Distinction) The Gini Coefficient in Economics

○ The concept was first introduced to evaluate income inequality: It can assess not only income but also the degree of inequality in the distribution of a discrete variable.

○ As the Gini coefficient moved from economics to the fields of machine learning and information theory, it implies similar inequality, but the mathematical definition changes.

○ Difference: If the distribution of the variable is entirely uniform,

○ In economics, the Gini coefficient becomes extremely egalitarian, so it has a value of 0.

○ In information theory, the Gini coefficient attains its maximum value due to the Cauchy-Schwarz inequality. Intuitively, it becomes most disordered, thus reaching the maximum value.

⑹ Comparison 4. Connectivity index (CI)

① In a spatial lattice, if a specific spot and its nearest neighbor are in different clusters, the CI value increases.

② Therefore, a higher CI value implies higher heterogeneity.

⑺ Comparison 5. Simpson index

① Another measure based on information theory, along with Shannon entropy

② Based on spatial transcriptomics, the probability of two random spots belonging to the same cluster (ref)

③ The lower the Simpson index, the higher the heterogeneity implied

⑻ Comparison 6. Cluster modularity

① Based on spatial transcriptomics, a metric indicating the purity (homogeneity) of spots within a cluster (ref)

⑼ Comparison 7. Lempel-Ziv complexity

⑽ Comparison 8. Silhouette coefficient

⑾ Comparison 9. Calinski-Harabasz index

2. Variational Inference

⑴ Overview

① The probability that B was the cause when A was observed, denoted as P(B | A), is called the posterior probability (cf. Bayes’ theorem).

② Variational Inference: A method to infer a complex posterior distribution by introducing a simpler variational distribution and conducting inference in a way that reduces the difference between the two.

⑵ ELBO (Evidence Lower Bound): The standard method for variational inference

① Jensen’s inequality: For functions like the logarithmic function, which are concave, the following holds:

② Problem situation: It is difficult to directly calculate the log evidence of a model, log p(x), for the given data x and latent variables z.

③ Mathematical formulation:

○ ELBO = 𝔼_q(z) [log p(x, z)] - 𝔼_q(z) [log q(z)]

○ In variational inference, maximizing the ELBO makes the inequality approach equality.

○ When the equality is established p(x,z)/q(z) becomes constant with respect to z.

○ p(x, z) / q(z) = constant with respect to z ⇔ q(z) ∝ p(x, z) ⇔ q(z) = q(z x) = p(x, z) / p(x) = p(z x)

○ Conclusion: Maximizing the ELBO makes the variational distribution q(z∣x) close to the true posterior p(z∣x).

④ Example 1

⑤ Example 2. Loss functions of Variational Autoencoder (VAE) and Graph Variational Autoencoder (GVAE)

3. Modern Information Theory

⑴ History

① Ludwig Boltzmann (1872): Proposed the H-theorem to explain the entropy of gas particles.

② Harry Nyquist (1924): Quantified the speed at which “intelligence” is propagated through a communication system.

③ John von Neumann (1927): Extended Gibbs free energy to quantum mechanics.

④ Ralph Hartley (1928): Expressed the number of distinguishable information by using logarithms.

⑤ Alan Turing (1940): Introduced deciban for decrypting the German Enigma cipher.

⑥ Richard W. Hamming (1947): Developed Hamming code.

⑦ Claude E. Shannon (1948): Pioneered information theory. A Mathematical Theory of Communication

⑧ Solomon Kullback & Richard Leibler (1951): Introduced Kullback-Leibler divergence

⑨ David A. Huffman (1951): Developed Huffman encoding.

⑩ Alexis Hocquenghem & Raj Chandra Bose & Dwijendra Kumar Ray-Chaudhuri (1960): Developed BCH code.

⑪ Irving S. Reed & Gustave Solomon (1960): Developed Reed-Solomon code.

⑫ Robert G. Gallager (1962): Introduced low-density parity check.

⑬ Kolmogorov (1963): Introduced Kolmogorov complexity (minimum description length).

⑭ Abraham Lempel & Jacob Ziv (1977): Developed Lempel-Ziv compression (LZ77).

⑮ Claude Berrou & Alain Glavieux & Punya Thitimajshima (1993): Developed Turbo code.

⑯ Erdal Arıkan (2008): Introduced polar code.

⑵ Type 1. Encryption

① Huffman code

② Shannon code

③ Shannon-Fano-Elias code (alphabetic code)

⑶ Type 2. Compression

① Uniquely decodable (UD)

② Kraft inequality

⑷ Type 3. Network information theory

⑸ Type 4. Quantum information theory

Input: 2021-11-10 22:28

Modified: 2023-03-21 16:22

2145

Chapter 21. Information Theory

1. Information Theory

2. Variational Inference

3. Modern Information Theory

results matching ""

No results matching ""