Korean, Edit

Chapter 20. Information Theory

Higher category : 【Statistics】 Statistics Table of Contents


1. Information Theory

2. Modern Information Theory



1. Information Theory

⑴ Overview

① low probability event: high information (surprising)

② high probability event: low information (unsurprising)

entropy

① Problem situation: X is Raining / Not Raining. Y is Cloudy / Not Cloudy


image

Figure. 1. Entropy example


② Basic concepts of statistics


image


information: everything obtained when the result is known

uncertainty (surprisal, unexpectedness, randomness)


image


Definition 1. Absence of information about the outcome

Definition 2. Number of bits required to represent the given number of cases

○ When the base is 2, the unit used is bits

○ When the base is e (natural logarithm), the unit used is nats

entropy (self information)


image


○ Understanding

○ In other words, entropy is defined as the average value of uncertainty

○ Similar to entropy in physics, higher values are obtained as disorder increases: higher entropy means higher heterogeneity

○ It applies the surprisal (-log p(x)) of each event to the probability weight (p(x)) of each event

○ Examples

○ Fair coin toss : H = -0.5 log2 0.5 - 0.5 log2 0.5 = 1

○ Fair die **: H = 6 × (-(1/6) log2 (1/6)) = log2 6 = 2.58

○ If X follows a uniform distribution, H(X) = log n holds

cross entropy

○ Definition: Average number of bits required to distinguish two probability distributions, p and q

○ In machine learning, model prediction


image


Feature 1. H(X) is shift-invariant

○ When Y = X + a, H(Y) = H(X) is established because pY(y) = pX(x)

Feature 2. H(X) ≥ 0

Feature 3. Hb(X) = (logba) Ha(X) (where a and b are under the log)

○ ○ logbp = (logba) logap

joint entropy


image


Feature 1. If X and Y are independent, H(X, Y) = H(X) + H(Y)


image


⑦ Conditional entropy

○ Conditional entropy for a specific condition


image


○ Total conditional entropy


image


○ Information gain (IG): The increase in order when knowing the information X


image


○ Information bottleneck method (IB) : Describing the trade-off between complexity and accuracy using information theory

○ Application : For example, it can determine the number of clusters in given data.

Feature 1. If X and Y are independent, H(Y | X) = H(Y)

Feature 2. H(Y | Y) = 0

Feature 3. H(X, Y) = H(X) + H(Y | X) = H(Y) + H(X | Y) (Chain rule)


image


Feature 4. H(X, Y | Z) = H(X | Z) + H(Y | X, Z)


image


⑧ Relative entropy (Kullback Leibler distance, KL distance)


image


○ Overview

○ A measure of how two probability distributions diverge.

○ It indicates the divergence between an approximate distribution p(x), which aims to mimic the true distribution, and the true distribution q(x).

○ Typically, the true distribution is a complex probability distribution function, while the approximate distribution is represented using a parameterized model.

○ The approximate distribution can be made manageable, often resembling a simple distribution like the normal distribution.

○ Mutual information


image


○ Information gain and mutual information are equivalent.


image


○ Mutual information can be used for assessing image similarity.

Feature 1. asymmetry : DKL(p(x)   q(x)) ≠ DKL(q(x)   p(x))

Feature 2. D(p || q) ≥ 0 (equality condition: p(x) = q(x))


image


Feature 3. Mutual information ≥ 0 (equality condition: X and Y are independent)


image


Feature 4. H(X) ≤ log n (where n is the number of elements X can take)


image


Feature 5. H(X | Y) ≤ H(X) (equality condition: X and Y are independent)


image


Application. Determining the predictive data distribution p with the smallest KL distance to the true unknown data distribution q.

○ Maximizing log-likelihood minimizes KL distance.


image


⑨ Asymptotic equipartition property (AEP)


image


Comparison 1. Gini impurity


image


① Entropy is not the unique measure, alternative measures like Gini impurity are proposed.

② Gini impurity considers the probability of mislabeling (1 - P(xi)) weighted by the weight of each sample (P(xi)).

③ Both Gini impurity and entropy increase as the given data becomes more disordered: Origin of impurity.

④ (Conceptual Distinction) The Gini Coefficient in Economics

○ The concept was first introduced to evaluate income inequality: It can assess not only income but also the degree of inequality in the distribution of a discrete variable.

○ As the Gini coefficient moved from economics to the fields of machine learning and information theory, it implies similar inequality, but the mathematical definition changes.

○ Difference: If the distribution of the variable is entirely uniform,

○ In economics, the Gini coefficient becomes extremely egalitarian, so it has a value of 0.

○ In information theory, the Gini coefficient attains its maximum value due to the Cauchy-Schwarz inequality. Intuitively, it becomes most disordered, thus reaching the maximum value.

Comparison 2. Connectivity index (CI)

① In a spatial lattice, if a specific spot and its nearest neighbor are in different clusters, the CI value increases.

② Therefore, a higher CI value implies higher heterogeneity.

Comparison 3. Simpson index

① Another measure based on information theory, along with Shannon entropy

② Based on spatial transcriptomics, the probability of two random spots belonging to the same cluster (ref)

③ The lower the Simpson index, the higher the heterogeneity implied

Comparison 4. Cluster modularity

① Based on spatial transcriptomics, a metric indicating the purity (homogeneity) of spots within a cluster (ref)

Comparison 5. Lempel-Ziv complexity

Comparison 6. Silhouette coefficient

Comparison 7. Calinski-Harabasz index

Application 1. Markov process


image


① two-state Markov chain


image

image


Figure. 2. two-scale Markov chain


② Hidden Markov model

○ If χ = {Xi} is a Markov process and Yi = ϕ(Xi) (where ϕ is a deterministic function), then y = {Yi} is a hidden Markov model.

Feature 1. If Pr(xn | xn-1) is independent of n, the Markov process is stationary (time-invariant).

Feature 2. Aperiodic ⊂ irreducible.

Feature 3. Markov process can prove the second law of thermodynamics (law of increasing entropy).

○ Assumes uniform stationary distribution.



2. Modern Information Theory

⑴ History

① Ludwig Boltzmann (1872) : Proposed the H-theorem to explain the entropy of gas particles.

② Harry Nyquist (1924) : Quantified the speed at which “intelligence” is propagated through a communication system.

③ John von Neumann (1927): Extended Gibbs free energy to quantum mechanics.

④ Ralph Hartley (1928): Expressed the number of distinguishable information by using logarithms.

⑤ Alan Turing (1940): Introduced deciban for decrypting the German Enigma cipher.

⑥ Richard W. Hamming (1947): Developed Hamming code.

⑦ Claude E. Shannon (1948): Pioneered information theory. A Mathematical Theory of Communication

⑧ Solomon Kullback & Richard Leibler (1951): Introduced Kullback-Leibler divergence

⑨ David A. Huffman (1951): Developed Huffman encoding.

⑩ Alexis Hocquenghem & Raj Chandra Bose & Dwijendra Kumar Ray-Chaudhuri (1960): Developed BCH code.

⑪ Irving S. Reed & Gustave Solomon (1960): Developed Reed-Solomon code.

⑫ Robert G. Gallager (1962): Introduced low-density parity check.

⑬ Kolmogorov (1963): Introduced Kolmogorov complexity (minimum description length).

⑭ Abraham Lempel & Jacob Ziv (1977): Developed Lempel-Ziv compression (LZ77).

⑮ Claude Berrou & Alain Glavieux & Punya Thitimajshima (1993): Developed Turbo code.

⑯ Erdal Arıkan (2008): Introduced polar code.

Type 1. Encryption

① Huffman code

② Shannon code

③ Shannon-Fano-Elias code (alphabetic code)

Type 2. Compression

① Uniquely decodable (UD)

② Kraft inequality

Type 3. Network information theory

Type 4. Quantum information theory



*Input: 2021-11-10 22:28*

*Modified: 2023-03-21 16:22*

results matching ""

    No results matching ""