Korean, Edit

Chapter 21. Information Theory

Higher category: 【Statistics】 Statistics Table of Contents


1. Information Theory

2. Variational Inference

3. Modern Information Theory



1. Information Theory

⑴ Overview

① low probability event: high information (surprising)

② high probability event: low information (unsurprising)

entropy

① Problem situation: X is Raining / Not Raining. Y is Cloudy / Not Cloudy


image

Figure. 1. Entropy example


② Basic concepts of statistics


image


information: everything obtained when the result is known

uncertainty (surprisal, unexpectedness, randomness)


image


○ When the base is 2, the unit used is bits

○ When the base is e (natural logarithm), the unit used is nats

entropy (self information, Shannon entropy)


image


○ Understanding

○ In other words, entropy is defined as the average value of uncertainty

○ Similar to entropy in physics, higher values are obtained as disorder increases: higher entropy means higher heterogeneity

○ It applies the surprisal (-log p(x)) of each event to the probability weight (p(x)) of each event

○ Examples

○ Fair coin toss: H = -0.5 log2 0.5 - 0.5 log2 0.5 = 1

○ Fair die **: H = 6 × (-(1/6) log2 (1/6)) = log2 6 = 2.58

○ If X follows a uniform distribution, H(X) = log n holds

Source Coding Theorem

○ Definition: For a very long data sequence, if the compression rate is maximized, the average number of bits of the compressed data approaches the entropy.

○ In other words, it refers to the number of bits required to represent a given set of possibilities.

○ When stored in base log2, it represents the number of bits; when stored in base logk, it indicates the required storage capacity in a k-ary system.

○ Kraft’s Inequality: A constraint that ensures every entity can be represented by a unique bit sequence.


스크린샷 2024-10-01 오후 1 26 20


○ Proof of the Source Coding Theorem: For the average number of bits p1L1 + ··· + pnLn,


스크린샷 2024-10-01 오후 1 26 39


Feature 1. H(X) is shift-invariant

○ When Y = X + a, H(Y) = H(X) is established because pY(y) = pX(x)

Feature 2. H(X) ≥ 0

Feature 3. Hb(X) = (logba) Ha(X) (where a and b are under the log)

○ logbp = (logba) logap

Python code


import numpy as np

def shannon_index(data):
    # Count unique values and their occurrences in the data
    values, counts = np.unique(data, return_counts=True)
    # Calculate proportions using the counts
    proportions = counts / sum(counts)
    # Calculate Shannon index, considering only non-zero proportions
    return -np.sum(proportions * np.log(proportions))

# Example data, where '1' appears 3 times, '2' appears 2 times, and '3' appears 1 time
data = [1, 1, 1, 2, 2, 3]

# Calculate the Shannon index
index = shannon_index(data)
print(f"Shannon index: {index}")


joint entropy


image


Feature 1. H(X, Y) ≤ H(X) + H(Y): If X and Y are independent, H(X, Y) = H(X) + H(Y)


image


⑦ Conditional entropy


스크린샷 2024-10-01 오후 1 27 05

Figure 2. Conditional entropy visualization


○ Conditional entropy for a specific condition


image


○ Total conditional entropy


image


○ Information gain (IG): The increase in order when knowing the information X


image


○ Information bottleneck method (IB): Describing the trade-off between complexity and accuracy using information theory

○ Application: For example, it can determine the number of clusters in given data.

Feature 1. If X and Y are independent, H(Y | X) = H(Y) (cf. H(Y | X) ≤ H(Y))

Feature 2. H(Y | Y) = 0

Feature 3. H(X, Y) = H(X) + H(Y | X) = H(Y) + H(X | Y) (Chain rule)


image


Feature 4. H(X, Y | Z) = H(X | Z) + H(Y | X, Z)


image


⑧ Mutual information


image

Figure 3. Mutual information visualization


image


○ Information gain and mutual information are equivalent.


image


Feature 1. I(X; Y) = H(Y) - H(Y | X) = H(X) - H(X | Y)

Feature 2. Mutual information ≥ 0 (equality condition: X and Y are independent)


image


Feature 3. H(X | Y) ≤ H(X) (equality condition: X and Y are independent)


image


Feature 4. I(X; Y) ≤ H(X, Y): H(X, Y) - I(X; Y) satisfies the definition of mathematical distance.

Application 1. Mutual information can be used for assessing image similarity.

Application 2. mRMR (maximum relevance minimum redundancy): A technique used to select features that have high mutual information with the output while having low mutual information among themselves. It is commonly used in machine learning. Matlab code is as follows:


% Rank features using mRMR
selected_features_idx = fscmrmr(features, labels);


⑨ Relative entropy (Kullback Leibler divergence, KL divergence, KLD)


image


○ Overview

○ A measure of how two probability distributions diverge.

○ It indicates the divergence between an approximate distribution p(x), which aims to mimic the true distribution, and the true distribution q(x).

○ Typically, the true distribution is a complex probability distribution function, while the approximate distribution is represented using a parameterized model.

○ The approximate distribution can be made manageable, often resembling a simple distribution like the normal distribution.

○ Strictly speaking, it does not satisfy the definition of mathematical distance.

Feature 1. asymmetry: DKL(p(x) || q(x)) ≠ DKL(q(x) || p(x)) (∴ violates the definition of mathematical distance)

Feature 2. D(p || q) ≥ 0 (equality condition: p(x) = q(x))


image


Feature 3. H(X) ≤ log n (where n is the number of elements X can take)


image


Application. Determining the predictive data distribution p with the smallest KL distance to the true unknown data distribution q.

○ Maximizing log-likelihood minimizes KL distance.


image


Application. Jensen-Shannon divergence (JSD)


스크린샷 2024-10-05 오후 8 19 21


cross entropy

○ Definition: Average number of bits required to distinguish two probability distributions, p and q.

○ In machine learning, model prediction.


스크린샷 2024-10-01 오후 1 29 36


Feature 1. -∑ pi log qi = H(p) + DKL(p || q)

Feature 2. Minimizing DKL(p || q) is the same as minimizing cross-entropy because H(p) is constant.

Feature 3. When minimizing -∑ pi log qi under the assumption that pi follows a uniform distribution, this is called maximum likelihood estimation.

⑪ Asymptotic equipartition property (AEP)


image


Comparison 1. Rényi Entropy

① Hartley (max) entropy: For a source represented by M symbols (e.g., if expressed with the alphabet, M = 26)


스크린샷 2024-10-01 오후 1 42 57


② Min entropy: Related to signal processing


스크린샷 2024-10-01 오후 1 43 23


③ Rényi entropy: Encompasses Shannon entropy (α → 1; by L’Hôpital’s rule), max entropy (α → 0), and min entropy (α → ∞)


스크린샷 2024-10-01 오후 1 43 42


④ Rényi divergence: As α → ∞, Rényi divergence converges to KL divergence (∵ L’Hôpital’s rule)


스크린샷 2024-10-01 오후 1 44 09


Comparison 2. Sample Entropy

① A dynamic measure that reflects time

② Important in signal processing

Comparison 3. Gini impurity


image


① Entropy is not the unique measure, alternative measures like Gini impurity are proposed.

② Gini impurity considers the probability of mislabeling (1 - P(xi)) weighted by the weight of each sample (P(xi)).

③ Both Gini impurity and entropy increase as the given data becomes more disordered: Origin of impurity.

④ (Conceptual Distinction) The Gini Coefficient in Economics

○ The concept was first introduced to evaluate income inequality: It can assess not only income but also the degree of inequality in the distribution of a discrete variable.

○ As the Gini coefficient moved from economics to the fields of machine learning and information theory, it implies similar inequality, but the mathematical definition changes.

○ Difference: If the distribution of the variable is entirely uniform,

○ In economics, the Gini coefficient becomes extremely egalitarian, so it has a value of 0.

○ In information theory, the Gini coefficient attains its maximum value due to the Cauchy-Schwarz inequality. Intuitively, it becomes most disordered, thus reaching the maximum value.

Comparison 4. Connectivity index (CI)

① In a spatial lattice, if a specific spot and its nearest neighbor are in different clusters, the CI value increases.

② Therefore, a higher CI value implies higher heterogeneity.

Comparison 5. Simpson index

① Another measure based on information theory, along with Shannon entropy

② Based on spatial transcriptomics, the probability of two random spots belonging to the same cluster (ref)

③ The lower the Simpson index, the higher the heterogeneity implied

Comparison 6. Cluster modularity

① Based on spatial transcriptomics, a metric indicating the purity (homogeneity) of spots within a cluster (ref)

Comparison 7. Lempel-Ziv complexity

Comparison 8. Silhouette coefficient

Comparison 9. Calinski-Harabasz index



2. Variational Inference

⑴ Overview

① The probability that B was the cause when A was observed, denoted as P(B A), is called the posterior probability (cf. Bayes’ theorem).

② Variational Inference: A method to infer a complex posterior distribution by introducing a simpler variational distribution and conducting inference in a way that reduces the difference between the two.

⑵ ELBO (Evidence Lower Bound): The standard method for variational inference

① Jensen’s inequality: For functions like the logarithmic function, which are concave, the following holds:


스크린샷 2024-10-09 오전 10 37 33


② Problem situation: It is difficult to directly calculate the log evidence of a model, log p(x), for the given data x and latent variables z.


스크린샷 2024-10-09 오전 10 37 59


Mathematical formulation:


스크린샷 2024-10-09 오전 10 38 44


○ ELBO = 𝔼q(z) [log p(x, z)] - 𝔼q(z) [log q(z)]

○ 𝔼q(z) [log p(x, z)]: The expected value of the joint probability of the model calculated with respect to the variational distribution. It approximates the joint probability of the data x and the latent variables z using the variational distribution.

○ ELBO provides a lower bound for log p(x), the quantity we want to compute.

○ In variational inference, we approximate the complex posterior distribution p(x) by maximizing the ELBO.

④ Example


스크린샷 2024-10-09 오전 10 39 49



3. Modern Information Theory

⑴ History

① Ludwig Boltzmann (1872): Proposed the H-theorem to explain the entropy of gas particles.

② Harry Nyquist (1924): Quantified the speed at which “intelligence” is propagated through a communication system.

③ John von Neumann (1927): Extended Gibbs free energy to quantum mechanics.

④ Ralph Hartley (1928): Expressed the number of distinguishable information by using logarithms.

⑤ Alan Turing (1940): Introduced deciban for decrypting the German Enigma cipher.

⑥ Richard W. Hamming (1947): Developed Hamming code.

⑦ Claude E. Shannon (1948): Pioneered information theory. A Mathematical Theory of Communication

⑧ Solomon Kullback & Richard Leibler (1951): Introduced Kullback-Leibler divergence

⑨ David A. Huffman (1951): Developed Huffman encoding.

⑩ Alexis Hocquenghem & Raj Chandra Bose & Dwijendra Kumar Ray-Chaudhuri (1960): Developed BCH code.

⑪ Irving S. Reed & Gustave Solomon (1960): Developed Reed-Solomon code.

⑫ Robert G. Gallager (1962): Introduced low-density parity check.

⑬ Kolmogorov (1963): Introduced Kolmogorov complexity (minimum description length).

⑭ Abraham Lempel & Jacob Ziv (1977): Developed Lempel-Ziv compression (LZ77).

⑮ Claude Berrou & Alain Glavieux & Punya Thitimajshima (1993): Developed Turbo code.

⑯ Erdal Arıkan (2008): Introduced polar code.

Type 1. Encryption

① Huffman code

② Shannon code

③ Shannon-Fano-Elias code (alphabetic code)

Type 2. Compression

① Uniquely decodable (UD)

② Kraft inequality

Type 3. Network information theory

Type 4. Quantum information theory



Input: 2021-11-10 22:28

Modified: 2023-03-21 16:22

results matching ""

    No results matching ""