Korean, Edit

Chapter 21. Information Theory

Higher category : 【Statistics】 Statistics Table of Contents


1. Information Theory

2. Variational Inference

3. Modern Information Theory



1. Information Theory

⑴ Overview

① low probability event: high information (surprising)

② high probability event: low information (unsurprising)

entropy

① Problem situation: X is Raining / Not Raining. Y is Cloudy / Not Cloudy


image

Figure. 1. Entropy example


② Basic concepts of statistics


image


information: everything obtained when the result is known

uncertainty (surprisal, unexpectedness, randomness)


image


○ When the base is 2, the unit used is bits

○ When the base is e (natural logarithm), the unit used is nats

entropy (self information, Shannon entropy)


image


○ Understanding

○ In other words, entropy is defined as the average value of uncertainty

○ Similar to entropy in physics, higher values are obtained as disorder increases: higher entropy means higher heterogeneity

○ It applies the surprisal (-log p(x)) of each event to the probability weight (p(x)) of each event

○ Examples

○ Fair coin toss : H = -0.5 log2 0.5 - 0.5 log2 0.5 = 1

○ Fair die **: H = 6 × (-(1/6) log2 (1/6)) = log2 6 = 2.58

○ If X follows a uniform distribution, H(X) = log n holds

Source Coding Theorem

○ Definition: For a very long data sequence, if the compression rate is maximized, the average number of bits of the compressed data approaches the entropy.

○ In other words, it refers to the number of bits required to represent a given set of possibilities.

○ When stored in base log2, it represents the number of bits; when stored in base logk, it indicates the required storage capacity in a k-ary system.

○ Kraft’s Inequality: A constraint that ensures every entity can be represented by a unique bit sequence.


스크린샷 2024-10-01 오후 1 26 20


○ Proof of the Source Coding Theorem: For the average number of bits p1L1 + ··· + pnLn,


스크린샷 2024-10-01 오후 1 26 39


Feature 1. H(X) is shift-invariant

○ When Y = X + a, H(Y) = H(X) is established because pY(y) = pX(x)

Feature 2. H(X) ≥ 0

Feature 3. Hb(X) = (logba) Ha(X) (where a and b are under the log)

○ logbp = (logba) logap

Python code


import numpy as np

def shannon_index(data):
    # Count unique values and their occurrences in the data
    values, counts = np.unique(data, return_counts=True)
    # Calculate proportions using the counts
    proportions = counts / sum(counts)
    # Calculate Shannon index, considering only non-zero proportions
    return -np.sum(proportions * np.log(proportions))

# Example data, where '1' appears 3 times, '2' appears 2 times, and '3' appears 1 time
data = [1, 1, 1, 2, 2, 3]

# Calculate the Shannon index
index = shannon_index(data)
print(f"Shannon index: {index}")


joint entropy


image


Feature 1. H(X, Y) ≤ H(X) + H(Y): If X and Y are independent, H(X, Y) = H(X) + H(Y)


image


⑦ Conditional entropy


스크린샷 2024-10-01 오후 1 27 05

Figure 2. Conditional entropy visualization


○ Conditional entropy for a specific condition


image


○ Total conditional entropy


image


○ Information gain (IG): The increase in order when knowing the information X


image


○ Information bottleneck method (IB) : Describing the trade-off between complexity and accuracy using information theory

○ Application : For example, it can determine the number of clusters in given data.

Feature 1. If X and Y are independent, H(Y | X) = H(Y) (cf. H(Y | X) ≤ H(Y))

Feature 2. H(Y | Y) = 0

Feature 3. H(X, Y) = H(X) + H(Y | X) = H(Y) + H(X | Y) (Chain rule)


image


Feature 4. H(X, Y | Z) = H(X | Z) + H(Y | X, Z)


image


⑧ Mutual information


image

Figure 3. Mutual information visualization


image


○ Information gain and mutual information are equivalent.


image


Feature 1. I(X; Y) = H(Y) - H(Y | X) = H(X) - H(X | Y)

Feature 2. Mutual information ≥ 0 (equality condition: X and Y are independent)


image


Feature 3. H(X | Y) ≤ H(X) (equality condition: X and Y are independent)


image


Feature 4. I(X; Y) ≤ H(X, Y): H(X, Y) - I(X; Y) satisfies the definition of mathematical distance.

Application 1. Mutual information can be used for assessing image similarity.

Application 2. mRMR (maximum relevance minimum redundancy): A technique used to select features that have high mutual information with the output while having low mutual information among themselves. It is commonly used in machine learning.

⑨ Relative entropy (Kullback Leibler divergence, KL divergence, KLD)


image


○ Overview

○ A measure of how two probability distributions diverge.

○ It indicates the divergence between an approximate distribution p(x), which aims to mimic the true distribution, and the true distribution q(x).

○ Typically, the true distribution is a complex probability distribution function, while the approximate distribution is represented using a parameterized model.

○ The approximate distribution can be made manageable, often resembling a simple distribution like the normal distribution.

○ Strictly speaking, it does not satisfy the definition of mathematical distance.

Feature 1. asymmetry : DKL(p(x) || q(x)) ≠ DKL(q(x) || p(x)) (∴ violates the definition of mathematical distance)

Feature 2. D(p || q) ≥ 0 (equality condition: p(x) = q(x))


image


Feature 3. H(X) ≤ log n (where n is the number of elements X can take)


image


Application. Determining the predictive data distribution p with the smallest KL distance to the true unknown data distribution q.

○ Maximizing log-likelihood minimizes KL distance.


image


Application. Jensen-Shannon divergence (JSD)


스크린샷 2024-10-05 오후 8 19 21


cross entropy

○ Definition: Average number of bits required to distinguish two probability distributions, p and q.

○ In machine learning, model prediction.


스크린샷 2024-10-01 오후 1 29 36


Feature 1. -∑ pi log qi = H(p) + DKL(p || q)

Feature 2. Minimizing DKL(p || q) is the same as minimizing cross-entropy because H(p) is constant.

Feature 3. When minimizing -∑ pi log qi under the assumption that pi follows a uniform distribution, this is called maximum likelihood estimation.

⑪ Asymptotic equipartition property (AEP)


image


Comparison 1. Rényi Entropy

① Hartley (max) entropy: For a source represented by M symbols (e.g., if expressed with the alphabet, M = 26)


스크린샷 2024-10-01 오후 1 42 57


② Min entropy: Related to signal processing


스크린샷 2024-10-01 오후 1 43 23


③ Rényi entropy: Encompasses Shannon entropy (α → 1; by L’Hôpital’s rule), max entropy (α → 0), and min entropy (α → ∞)


스크린샷 2024-10-01 오후 1 43 42


④ Rényi divergence: As α → ∞, Rényi divergence converges to KL divergence (∵ L’Hôpital’s rule)


스크린샷 2024-10-01 오후 1 44 09


Comparison 2. Sample Entropy

① A dynamic measure that reflects time

② Important in signal processing

Comparison 3. Gini impurity


image


① Entropy is not the unique measure, alternative measures like Gini impurity are proposed.

② Gini impurity considers the probability of mislabeling (1 - P(xi)) weighted by the weight of each sample (P(xi)).

③ Both Gini impurity and entropy increase as the given data becomes more disordered: Origin of impurity.

④ (Conceptual Distinction) The Gini Coefficient in Economics

○ The concept was first introduced to evaluate income inequality: It can assess not only income but also the degree of inequality in the distribution of a discrete variable.

○ As the Gini coefficient moved from economics to the fields of machine learning and information theory, it implies similar inequality, but the mathematical definition changes.

○ Difference: If the distribution of the variable is entirely uniform,

○ In economics, the Gini coefficient becomes extremely egalitarian, so it has a value of 0.

○ In information theory, the Gini coefficient attains its maximum value due to the Cauchy-Schwarz inequality. Intuitively, it becomes most disordered, thus reaching the maximum value.

Comparison 4. Connectivity index (CI)

① In a spatial lattice, if a specific spot and its nearest neighbor are in different clusters, the CI value increases.

② Therefore, a higher CI value implies higher heterogeneity.

Comparison 5. Simpson index

① Another measure based on information theory, along with Shannon entropy

② Based on spatial transcriptomics, the probability of two random spots belonging to the same cluster (ref)

③ The lower the Simpson index, the higher the heterogeneity implied

Comparison 6. Cluster modularity

① Based on spatial transcriptomics, a metric indicating the purity (homogeneity) of spots within a cluster (ref)

Comparison 7. Lempel-Ziv complexity

Comparison 8. Silhouette coefficient

Comparison 9. Calinski-Harabasz index



2. Variational Inference

⑴ Overview

① The probability that B was the cause when A was observed, denoted as P(B A), is called the posterior probability (cf. Bayes’ theorem).

② Variational Inference: A method to infer a complex posterior distribution by introducing a simpler variational distribution and conducting inference in a way that reduces the difference between the two.

⑵ ELBO (Evidence Lower Bound): The standard method for variational inference

① Jensen’s inequality: For functions like the logarithmic function, which are concave, the following holds:


스크린샷 2024-10-09 오전 10 37 33


② Problem situation: It is difficult to directly calculate the log evidence of a model, log p(x), for the given data x and latent variables z.


스크린샷 2024-10-09 오전 10 37 59


Mathematical formulation:


스크린샷 2024-10-09 오전 10 38 44


○ ELBO = 𝔼q(z) [log p(x, z)] - 𝔼q(z) [log q(z)]

○ 𝔼q(z) [log p(x, z)]: The expected value of the joint probability of the model calculated with respect to the variational distribution. It approximates the joint probability of the data x and the latent variables z using the variational distribution.

○ ELBO provides a lower bound for log p(x), the quantity we want to compute.

○ In variational inference, we approximate the complex posterior distribution p(x) by maximizing the ELBO.

④ Example


스크린샷 2024-10-09 오전 10 39 49



3. Modern Information Theory

⑴ History

① Ludwig Boltzmann (1872) : Proposed the H-theorem to explain the entropy of gas particles.

② Harry Nyquist (1924) : Quantified the speed at which “intelligence” is propagated through a communication system.

③ John von Neumann (1927): Extended Gibbs free energy to quantum mechanics.

④ Ralph Hartley (1928): Expressed the number of distinguishable information by using logarithms.

⑤ Alan Turing (1940): Introduced deciban for decrypting the German Enigma cipher.

⑥ Richard W. Hamming (1947): Developed Hamming code.

⑦ Claude E. Shannon (1948): Pioneered information theory. A Mathematical Theory of Communication

⑧ Solomon Kullback & Richard Leibler (1951): Introduced Kullback-Leibler divergence

⑨ David A. Huffman (1951): Developed Huffman encoding.

⑩ Alexis Hocquenghem & Raj Chandra Bose & Dwijendra Kumar Ray-Chaudhuri (1960): Developed BCH code.

⑪ Irving S. Reed & Gustave Solomon (1960): Developed Reed-Solomon code.

⑫ Robert G. Gallager (1962): Introduced low-density parity check.

⑬ Kolmogorov (1963): Introduced Kolmogorov complexity (minimum description length).

⑭ Abraham Lempel & Jacob Ziv (1977): Developed Lempel-Ziv compression (LZ77).

⑮ Claude Berrou & Alain Glavieux & Punya Thitimajshima (1993): Developed Turbo code.

⑯ Erdal Arıkan (2008): Introduced polar code.

Type 1. Encryption

① Huffman code

② Shannon code

③ Shannon-Fano-Elias code (alphabetic code)

Type 2. Compression

① Uniquely decodable (UD)

② Kraft inequality

Type 3. Network information theory

Type 4. Quantum information theory



Input: 2021-11-10 22:28

Modified: 2023-03-21 16:22

results matching ""

    No results matching ""