11. Entropy

Entropy

General Definition

Entropy in General

General Entropy is a scientific concept associated with randomness, disorder, and uncertainty. Originating in thermodynamics, entropy measures the degree of microscopic randomness in nature and has applications in fields such as statistical physics and information theory. In these fields, entropy helps quantify uncertainty or disorder in systems, providing insights for disciplines like chemistry, biology, and information systems.

Introduction to Entropy in Information Theory

The core idea of information theory is that the "informational value" of a communicated message depends on the degree to which the content is surprising. A highly likely event conveys little information, while a highly unlikely event is much more informative. For example, knowing that a specific number will not win a lottery has low informational value, as most numbers will not win. However, knowing a particular number wins the lottery has high informational value due to the rarity of the event.

Surprisal and Self-Information

The information content, or surprisal (also known as self-information), of an event $E$ is a function that increases as the probability $p(E)$ decreases. If $p(E) \approx 1$, the surprisal is low, but if $p(E) \approx 0$, the surprisal is high. This relationship is described by:

\[I(E) = -\log_2(p(E)) \quad \text{or equivalently,} \quad I(E) = \log_2\left(\frac{1}{p(E)}\right),\]

where $\log$ is the logarithmic function. This formula yields 0 surprise if the event probability is 1, meaning the outcome is fully predictable and carries no information.

Entropy as Expected Information

Entropy represents the expected (average) amount of information gained by observing the outcome of a random variable.

For instance, rolling a six-sided die has higher entropy than flipping a fair coin because each outcome in a die roll has a smaller probability ($p = \frac{1}{6}$) than each outcome in a coin toss ($p = \frac{1}{2}$). Entropy is maximized when all outcomes are equally likely, as seen in a fair coin toss with an entropy of 1 bit.

Consider a process with two outcomes: an event with probability $p$ and its complement with probability $1 - p$. The maximum entropy occurs when $p = 0.5$, as neither outcome is more expected than the other. Conversely, entropy is zero when $p = 0$ or $p = 1$, as the outcome is predetermined and thus holds no uncertainty or surprise.

In English text, the entropy per character is relatively low due to its predictability. Common letters like ‘e’ and frequent patterns (e.g., ‘th’ or ‘qu’) reduce uncertainty, making the language more compressible. English text entropy typically ranges between 0.6 and 1.3 bits per character, meaning that significant portions of the text can be predicted based on prior knowledge.

Shannon Entropy

Named after Boltzmann’s H-theorem, Shannon defined the entropy $\mathrm{H}(X)$ of a discrete random variable $X$, which takes values in the set $\mathcal{X}$ with a probability distribution $p : \mathcal{X} \rightarrow [0, 1]$. The entropy $\mathrm{H}(X)$ can be defined as the expected value of the information content $\mathrm{I}(X)$, given by:

\[\mathrm{H}(X) = \mathbb{E}[\mathrm{I}(X)] = \mathbb{E}[-\log p(X)]\]

where $\mathbb{E}$ denotes the expected value operator and $\mathrm{I}(X)$ represents the information content of $X$.

Entropy

The entropy formula can be explicitly written as:

\[\mathrm{H}(X) = -\sum_{x \in \mathcal{X}} p(x) \log_b p(x),\]

where $b$ is the base of the logarithm. Common choices of $b$ are 2 (giving entropy in bits), Euler’s number $e$ (for nats), and 10 (for bans). The entropy value is non-negative, as the negative sign compensates for the fact that logarithms of probabilities (values between 0 and 1) are negative.

In cases where $p(x) = 0$ for some $x \in \mathcal{X}$, the summand $0 \log_b(0)$ is defined as zero, following the limit: $\lim_{p \to 0^+} p \log_b(p) = 0.$

Types of Entropy in Information Theory

Joint Entropy: For two discrete random variables $X$ and $Y$, the joint entropy $H(X, Y)$ measures the uncertainty of their combined occurrences:

\[H(X, Y) = -\sum_{x,y} p(x, y) \log p(x, y)\]

Joint entropy captures the uncertainty associated with the joint state of $X$ and $Y$.

Conditional Entropy: Conditional entropy $H(X|Y)$, also known as equivocation, measures the uncertainty of $X$ given that $Y$ is known:

\[H(X|Y) = -\sum_{y} p(y) \sum_{x} p(x|y) \log p(x|y)\]

This quantifies the remaining uncertainty about $X$ after observing $Y$.

Mutual Information

Definition: Mutual information $I(X; Y)$ quantifies the amount of information gained about one random variable by observing another. For variables $X$ and $Y$, it is defined as:

\[I(X; Y) = \sum_{x,y} p(x, y) \log \frac{p(x, y)}{p(x)p(y)}\]

It is also expressible as the reduction in uncertainty:

\[I(X; Y) = H(X) - H(X|Y)\]

This shows how knowing $Y$ reduces the uncertainty of $X$.

Kullback-Leibler Divergence (Information Gain)

Definition: The Kullback-Leibler (KL) divergence $D_{KL}(p || q)$ measures how one probability distribution $p$ diverges from a reference distribution $q$:

\[D_{KL}(p || q) = \sum_{x} p(x) \log \frac{p(x)}{q(x)}\]

KL divergence is not symmetric and represents the extra bits needed to encode data from $p$ if $q$ were used instead.

Directed Information

Definition: Directed information $I(X^n \to Y^n)$ quantifies the flow of information from a process $X^n = {X_1, X_2, \dots, X_n}$ to $Y^n = {Y_1, Y_2, \dots, Y_n}$, accounting for causality:

\[I(X^n \to Y^n) = \sum_{i=1}^{n} I(X_i; Y_i | Y^{i-1})\]

where $I(X_i; Y_i | Y^{i-1})$ is the conditional mutual information of $X_i$ and $Y_i$ given past values of $Y$. Directed information is used in communication systems involving feedback and causality.

AI, Machine Learning and Deep Learning

Keep learning!

11. Entropy

Entropy

General Definition

Introduction to Entropy in Information Theory

Surprisal and Self-Information

Entropy as Expected Information

Shannon Entropy

Types of Entropy in Information Theory

Mutual Information

Kullback-Leibler Divergence (Information Gain)

Directed Information