1. Information theory + ML. Entropy / Habr

I've long wanted to create educational materials on the topic of Information Theory + Machine Learning. I found some old drafts and decided to polish them up here, on Habr.

Information Theory and Machine Learning seem to me like an interesting pair of fields, the deep connection between which is often unknown to ML engineers, and whose synergy has not yet been fully revealed.

When I tell colleagues that LogLoss on the test set and Mutual Information between the prediction and the predicted value are directly related, and that you can get one from the other using a simple formula, they often express genuine surprise. There is a link from the LogLoss Wikipedia page to Mutual Information, but that's about it.

Information Theory can become (or rather, has become) a source of understanding how and why different ML methods work, especially neural networks, and can also provide ideas for improving gradient-based optimization methods.

Let's start with basic concepts like Entropy, Information in a message, Mutual Information, and channel capacity. Next, there will be materials on the similarity between tasks of maximizing Mutual Information and minimizing Loss in regression problems. Then there will be a section on Information Geometry: Fisher metric, geodesics, gradient methods, and their connection to Gaussian processes (moving along the gradient using SGD is moving along the geodesic with noise).

It's also necessary to touch upon AIC, Information Bottleneck, and discuss how information flows in neural networks – Mutual Information between layers (Information Theory of Deep Learning, Naftali Tishby), and much more. It's not certain that I'll be able to cover everything listed, but I'll try to get started.

1. Basic Definitions

$H({p_1, ..., P_M}) = \sum^M_{i = 1} p_i\log(1/p_i)\;\;\;\;\;\;\;(1)$ $\sum_{i=1}^M p_i =1, \;\; p_i > 0.\;\;\;\;\;\;$

Let's describe them. We'll start with basic definitions

Definition 1.1: Uncertainty is the logarithm base 2 of the number of equiprobable possibilities: $H = \log_2{N}$ . It is measured in bits. For example, the uncertainty of an unknown binary word of length k is equal to k.

Logarithm

The logarithm base 2 of a number is how many times you need to divide the number by 2 to get a number less than or equal to 1. For example,

$\log_2(1) = 0, ; \log_2(2) = 1, ; \log_2(4) = 2,; \log_2(8) = 3,; \log_2(1024) = 10.$

For non-powers of two, this function smoothly extends. For example,

$\log_2(3) = 1.5849\ldots, ; \log_2(10) = 3.3219\ldots$

An important property of logarithms:

$\log(x) - \log(y) = \log(x/y).$

Try to derive it from the non-strict definition above for the case when and are powers of two.

Binary words

Binary words of length k are sequences of zeros and ones of length k. Each character in a binary words is called a bit. For example, here are binary words of length 5:

00000, 00001, 00010, 00011, 00100, 00101, ... , 11100, 11101, 111110, 11111

There are 32 of them. $\log_2(32) = 5$ , which means the uncertainty of a 5-bit word is 5.

That's exactly the number of unknown bits in an unknown 5-bit binary word.

Thus, each symbol in a binary word is called a "bit," but a "bit" is also a unit of uncertainty measurement. And this example helps to understand why.

For brevity, $\log_2$ is simply denoted as $\log$ throughout.

Definition 1.2: Information in a message is the difference in uncertainty before receiving the message and after.

$I(message) = H(before) - H(after)\;\;\;\;\;\;\;\; (2)$

This is also measured in bits.

For example, Vanya has chosen a number from 1 to 100. We are told that it is less than or equal to 50. The uncertainty before the message is $\log⁡100$ , and after the message, it is $\log ⁡50$ . Therefore, there is 1 bit of information in this message. By skillfully asking binary questions (questions to which the answer is either YES or NO), you can extract exactly 1 bit of information.

Some questions are inefficient. For instance, the question "Is the number less than or equal to 25?" reduces uncertainty by $\log⁡100−\log⁡25=\log⁡(100/25)=2$ bits with a probability of 0.25, and with a probability of 0.75, it only reduces it by $\log⁡(100)−\log⁡(75)=\log⁡(100/75)$ bits, which on average is $0.25\log⁡(1/0.25)+0.75\log⁡(1/0.75)=0.811278$ bits. If you divide the set of possibilities in a ratio of , with your question, then the average amount of information in the answer is $x\log⁡(1/x)+(1−x)\log⁡(1/(1−x))$ .

We will denote this expression as or . Here, as programmers, we overload the function H to work in two cases - when a pair of numbers is input, the sum of which is 1, and when one number is from the interval [0, 1].

Definition 1.3: $H(\{x, 1-x\}) = H(x) = x \log (1/x) + (1 - x ) \log(1/(1-x))$

Plot H(x)

It's evident that by asking binary questions, you can extract a maximum of 1 bit of information on average (the maximum value of the function . Let's reiterate: yes, you can directly ask questions like "Is the number equal to 57?" and if you're lucky, you can get log(100) bits of information. However, if luck is not on your side, you will only get log(100/99) bits of information. The average number of bits of information for such types of questions is $(1/100) \cdot \log(100) + (99/100) \cdot \log(100/99) = 0.08079...$ which is noticeably less than 1.

In this example, there are 100 possibilities, which means the initial uncertainty is $\log⁡(100)$ – this is how many binary questions you, on average, need to ask Vanya to find out the answer. However, the number is not an integer, so you need to round it up.

If we ask not binary questions but questions that imply a natural number from 1 to M as the answer, we can get more than one bit of information in a single response. If we ask a question for which all M answers are equally likely, then on average, we will get $\log M$ bits. However, if the probability of answer i is p(i), then the average number of bits in the response will be:

$H(\{p_1, \ldots, p_M\}) = \sum_{i=1}^M p_i \log(1 / p_i).\;\;\;\;\;\;\;\;(1)$

Definition 1.4: The entropy of a discrete distribution $P=\{p_i\}_{i=1}^M$ is defined by the formula (1) above.

Here, we have overloaded the function H for the case when a discrete distribution is provided as input.

SO: The first straightforward way to arrive at the formula for entropy is to calculate the average number of bits of information in response to a question with equiprobable answers.

Let's go further. Suppose Vanya doesn't choose a number but samples it from the distribution $P=\{p_i\}_{i=1}^M$ . How many binary questions do we need to ask Vanya to determine the sampled number? Intuitively, we need to divide the set of possibilities into two subsets, not necessarily of equal size but with approximately equal total probability, and ask Vanya which of the two subsets contains the sampled number. Get the answer and continue in the same manner with the reduced set of possibilities - again dividing it into two subsets with roughly equal probability, asking Vanya which of the two subsets contains the sampled number, and so on. The idea is good but not quite practical. It turns out it's more effective to work backward and start building the partition tree from the bottom. Specifically, find the two least probable outcomes and merge them into a single new outcome, thus reducing the number of possibilities by 1. Then, once again, find the two least probable outcomes and merge them into a new one, and so on, ultimately constructing a binary tree. The numbers are located in the leaves of this tree. The internal nodes are labeled with sets of numbers from the subtree they represent, and the root is labeled with the set of all numbers. This tree provides an algorithm for how to ask Vanya questions. You should start at the root of the tree and ask Vanya which way to go - left or right (which of the two children subsets contains the sampled number) according to this tree structure. This tree not only provides the recipe for the fastest average method of guessing the number but also serves as the Huffman algorithm for data compression.

Task 1.1: Study the Huffman code. Prove that a text with an original length of characters, in its compressed form, has a length bounded from below by $N \cdot H(P)$ bits, and under favorable circumstances, it achieves this bound.

So, the formula arises when solving the problem of finding the minimum average number of binary questions needed to determine the outcome of a random variable with distribution .

This is the second way to arrive at formula (1).

For a random variable , we will use the following notations for the entropy of its distribution $P=\{{p_i}\}_{i=1}^M$ (once again, "overloading" the function ):

$H(\xi)$ or or $H({\{p_i}\}_{i=1}^M)$ or $H(\{p_i\}_i)$

There's yet another, a third, simple way to arrive at the formula for entropy, but it requires knowledge of the Stirling formula.

Task 1.2: You have an unknown binary word of length k (a sequence of ones and zeros, a total of characters). You are told that it contains 35% ones. What is the value of for large $k \; ?$

Answer

Approximately $\log 2 - (p \log(1 / p) + (1 - p) \log(1 / (1 - p))) = \log 2 - H(\{p, 1-p\})$ , where

Task 1.3: Vanya has an unknown word of length k in an alphabet of size M. He has provided the proportions of all the letters in the word as $P = {p_1, ..., p_M}$ .
What is the value of I(message) / k for large k?

Answer

$\approx \log M - \sum_{i=1}^M{p_i \log(1 / p_i)} = \log M - H(P)$

So, Task 1.3 is indeed the third way to arrive at formula (1).

Definition 1.5: Information in a message about a certain random variable is the difference in entropies: $I(message) = H(P_{apriori}) - H(P_{aposteriori}).$

The values of a random discrete variable can be seen as letters, and each successive letter in a word is just another measurement of the random variable. So, the information in a message about a certain random variable is the number of bits of information about measurements of that random variable, normalized by the number of measurements.

Problem 1.4: What is the entropy of the discrete distribution $P = {\{\lambda ^ i \cdot (1 - \lambda)}\}_{i = 0}^{\infty}$ ? How much information is contained in the message $\xi \ne 0$ , where $\xi$ has the distribution ?

Answer

$H(P) = H(\lambda) / (1 - \lambda)$ $I(``\xi \ne 0'') = 0$

This result requires acceptance. How can this be? – We were given seemingly non-zero information, eliminated the most probable option from the possibilities. However, the uncertainty about the remaining possibilities remains the same, so the formula yields an answer of 0.

Task 1.5: Provide an example of a finite distribution and a message that does not reduce but increases uncertainty.

Ответ

$P = \{0.8, 0.1,0.1\}$ , and message = "this is not the first element". Then $P_{after} = \{0, 0.5, 0.5\},\ H(P) = 0.9219,\ H(P_{after}) = 1.0$

The ancient wisdom "in much wisdom is much grief" in this context gets another interesting interpretation: the modern world, science, and human life are such that new "messages" about history and the structure of the world only increase uncertainty.

Discrete distributions on a countable set of values that decay exponentially (geometric progressions) have the property of unchanged uncertainty when receiving information, meaning that among the first elements, there is no correct answer. Distributions with less than exponential decay (e.g., $p_k={C / k^2}$ ) only increase uncertainty when discarding the first elements.

Problem 1.6: Write the formula for the entropy of the Poisson distribution

$p_k=e^{-\lambda}{\lambda^k \over k!}, k=0,;1,\ldots,+\infty$

Find a simple approximation for large $\lambda$ .

Answer

$H(\{p_k\}_{k=0}^{\infty})={1\over 2} \log (2\pi e \lambda) - {1\over 12 \lambda}-{1\over 24 \lambda^2}+O(1/\lambda^3)$

Task 1.7: Given the distribution $\rho(x)$ of a real random variable. Let $Q(\rho,\varepsilon)$ be the average number of binary questions needed to determine some outcome of the random variable with an accuracy of $\varepsilon$ . Find an approximate expression for $Q(\rho,\varepsilon)$ for small values of $\varepsilon$ .

Answer

$\sum_i \rho(x_i) \cdot \varepsilon \cdot \log (1 / (\rho(x_i) \cdot \varepsilon) ) \approx \log(1/\varepsilon) - \int_{-\infty}^{+\infty} \rho(x)\log(\rho(x))dx =$ $= \log(1/\varepsilon) + H(\rho)$

(see definition of entropy of a continuous distribution below).

Definition 1.6: The entropy of a continuous distribution is given by

$H(\rho) = -\int_X \rho(x) \log(\rho(x)) dx$

Here, we have once again overloaded the symbol H for the case when the argument is a probability density function (PDF).

Task 1.8: Given two distributions $\rho_1(x)$ and $\rho_2(x)$ of two real random variables. What does the difference $Q(\rho_1, \varepsilon) - Q(\rho_2, \varepsilon)$ approach as $\varepsilon\to0$ ?

Answer

$H(\rho_1) - H(\rho_2)$

Task 1.9: What is the entropy of a normal distribution $\mathcal{N}(\mu, \sigma)$ ?

Answer

$1/2 \cdot (1 + \log(2\pi \sigma^2))$

Task 1.10: Write down the formula for the entropy of an exponential distribution $\rho(x)=e^{-x/a}/a$ .

Problem 1.11: A random variable $\xi$ is a mixture of two random variables, meaning its distribution $\rho(x)$ is a weighted sum of distributions:

$\rho(x) = w_1\rho_1(x) + w_2\rho_2(x),\;\;\;\;w_1 + w_2 = 1$

Let the set of values taken by $\xi_1$ not overlap with the set of values taken by $\xi_2$ , in other words, let the supports of these two random variables not intersect. Find an expression for the entropy $H(\xi)$ in terms of the entropies $H(\xi_1)$ and $H(\xi_2)$ .

Answer

$H(\rho) = -\int_{X_1+X_2} \rho(x) \log(\rho(x)) dx \\=-\int_{X_1+X_2} (w_1\rho_1(x)+ w_2\rho_2(x))\log(w_1\rho_1(x) + w_2\rho_2(x)) dx =\\ =-\int_{X_1} w_1\rho_1(x) \log(w_1\rho_1(x)) dx-\int_{X_2} w_2\rho_2(x) \log(w_2\rho_2(x)) dx$

The last equality here is possible only because the supports of and from the two distributions do not overlap as stated in the problem. Next, we will transform this expression into

$-w_1\int_{X_1} \rho_1(x) \log(w_1\rho_1(x))- w_2\int_{X_2} \rho_2(x) \log(w_2\rho_2(x))dx=\\=w_1\cdot H_1 + w_2\cdot H_2 -w_1\log w_1 - w_2\log w_2$

In this problem, the goal was to show that even in the simple case of non-overlapping supports, entropies do not simply add up with their respective weights, but there is an additional term $-(w_1\log w_1 + w_2 \log w_2)$ . When the weights are equal to 1/2, this additional term is equal to 1.

The interpretation of the formula is as follows: when an outcome is measured, with a probability of , it lies in , and with a probability of , it lies in . Consequently, we incur an uncertainty of values on the set , or an uncertainty of on the set . However, to determine in which set the measurement lies, on average, we will need to ask $-(w_1\log w_1 + w_2 \log w_2)$ questions.

From this, it follows, in particular, that a mixture with coefficients of 1/2 of two normal random variables with the same variance, but significantly different means, has an entropy, that is 1 greater than the entropy of a single normal distribution. The supports of normal random variables span the entire real line, which means they overlap, but in the case of significantly different means, this overlap can be neglected.

Task 1.12: A random variable $\xi$ is equally likely to be 0 or 1. Another random variable $\nu$ depends on $\xi$ : if $\xi=0$ , then $\nu$ is sampled from $\mathcal{N}(0, 1)$ , and if $\xi=1$ , then $\nu$ is sampled from $\mathcal{N}(\mu, 1)$ . How many bits of information about the random variable $\nu$ are contained in the message $``\xi=0"$ (as a function of $\mu$ )?

Answer

There is a numerical answer to this:

It's clear that the graph starts at 0 and approaches 1:

When $\mu=0$ , both Gaussians are identical, and the message about which one was chosen provides no information.
When $\mu=5$ , we have a mixture of two Gaussian distributions with widely separated centers. The message about the value of $\xi$ indicates which of the two "bell curves" the answer is in, reducing the number of possibilities by approximately half, and uncertainty decreases by 1. The "approximation" is related to the fact that the "curves" overlap, but the overlap size quickly decreases with increasing $\mu$ .
In the vicinity of $\mu=0$ , there is a quadratic growth, roughly $(\mu/2.37)^2$ .
The approach to 1 occurs approximately according to the law $1-e^{-(\mu/2.74)^2 - 0.21}.$

Task 1.13: The random variable $\xi$ is structured as follows: first, a number $\lambda$ is sampled from an exponential distribution with mean $\lambda_0=10$ , and then a random number is sampled from a Poisson distribution with parameter $\lambda$ . We received a message that one measurement yielded . How many bits of information did we gain about the random variable $\lambda$ ? Provide a numerical answer. How many bits of information will the sequence of measurements 10, 9, 11, 8, 10 provide?

$I(``x=10") = (4861 - 2520\cdot \gamma - 252\cdot \log(3628800/11))/(252\cdot \log(2)) \approx 1.17. \\ I(``x=10, 9, 11, 8, 10") = 3.729.$

Task 1.14: The random variable $\xi$ is constructed as follows: first, a number is sampled once from a beta distribution with parameters $\alpha=9, \beta=1$ , and then a random number is sampled from a binomial distribution with parameters $n=100, \;p$ . We received a message that one measurement of $\xi$ resulted in 10 (meaning 10 out of 100 coin flips resulted in heads). How many bits of information did we gain about the hidden random variable ? Provide a numerical answer. How many bits of information will a sequence of measurements 10, 9, 11, 8, 10 provide?

Task 1.15: Random variables $p_1,\; p_2,\;\ldots,p_{10}$ are sampled from a beta distribution with parameters $\alpha=9, \beta=1$ . The actual values of $p_1,\; p_2,\;\ldots,p_{10}$ are unknown to us, but we have been given 10 measurements $x_1,\;x_2,\;\ldots,x_{10}$ from 10 binomial distributions with parameters $n=100, \;p_i$ , and this is our knowledge about $p_1,\; p_2,\;\ldots,p_{10}$ . How many bits of information will we gain on average about a random binomial variable with parameters $n=100, \;p_i$ when we are told the value of ? What if $p_1,\; p_2,\;\ldots,p_{10}$ are known with absolute certainty (i.e., $n=\infty$ )? What if 10 is replaced with $\infty$ ?

This problem can be formulated in the language of machine learning as follows: we have a categorical feature to predict a binary variable ('will the user click on the banner or not'). How good will our prediction be if, in the training data, we only have historical data about the number of clicks and non-clicks for these 10 categories?

Task 1.16: A random variable $\xi$ follows a normal distribution $\mathcal{N}(a,\varepsilon^2)$ . The value of is unknown to us, but we know that it was sampled from a distribution $\mathcal{N}(0,\sigma^2)$ . On average, how much information will we gain about from the first measurement of $\xi$ ? How will the amount of information gained grow with the number of measurements?

Answer

The initial variance is equal to $\sigma^2$ . After measurements, the variance decreases to

$\sigma_m^2=\left({1\over \sigma^2} + {m \over \varepsilon^2}\right)^{-1}$ .

From the initial entropy $C + 0.5\cdot \log(\sigma^2)$ , subtract the final entropy $C + 0.5\cdot \log(\sigma_m^2)$ , and we get

${1\over 2} \cdot \log\left(1 + m\cdot {\varepsilon^2 \over \sigma^2}\right)$

Thus, for large , the information grows as $c + {1\over 2}\log(m)$ , and the error $\sigma_m$ decreases approximately proportionally to $1/\sqrt{m}$ .

This means that the number of correct digits in the decimal representation of the number grows as $\log(\sqrt{m}) /\log(10)$ . If you want to get one more correct decimal digit of the number , you need to increase the number of measurements by a factor of 100.

Continuation:

Part 2 - Mutual Information: In this part, the concept of Mutual Information is explained. It opens doors to error-correcting codes, compression algorithms, and provides a new perspective on Machine Learning problems.
Part 3 - ML & Mutual Information: Fundamentals of Machine Learning in the context of information theory.