1. Information theory + ML. Entropy
I've long wanted to create educational materials on the topic of Information Theory + Machine Learning. I found some old drafts and decided to polish them up here, on Habr.
Information Theory and Machine Learning seem to me like an interesting pair of fields, the deep connection between which is often unknown to ML engineers, and whose synergy has not yet been fully revealed.
When I tell colleagues that LogLoss on the test set and Mutual Information between the prediction and the predicted value are directly related, and that you can get one from the other using a simple formula, they often express genuine surprise. There is a link from the LogLoss Wikipedia page to Mutual Information, but that's about it.
Information Theory can become (or rather, has become) a source of understanding how and why different ML methods work, especially neural networks, and can also provide ideas for improving gradient-based optimization methods.
Let's start with basic concepts like Entropy, Information in a message, Mutual Information, and channel capacity. Next, there will be materials on the similarity between tasks of maximizing Mutual Information and minimizing Loss in regression problems. Then there will be a section on Information Geometry: Fisher metric, geodesics, gradient methods, and their connection to Gaussian processes (moving along the gradient using SGD is moving along the geodesic with noise).
It's also necessary to touch upon AIC, Information Bottleneck, and discuss how information flows in neural networks – Mutual Information between layers (Information Theory of Deep Learning, Naftali Tishby), and much more. It's not certain that I'll be able to cover everything listed, but I'll try to get started.
1. Basic Definitions
Let's describe them. We'll start with basic definitions
Definition 1.1: Uncertainty is the logarithm base 2 of the number of equiprobable possibilities:
Logarithm
The logarithm base 2 of a number is how many times you need to divide the number by 2 to get a number less than or equal to 1. For example,
For non-powers of two, this function smoothly extends. For example,
An important property of logarithms:
Try to derive it from the non-strict definition above for the case when
Binary words
Binary words of length k are sequences of zeros and ones of length k. Each character in a binary words is called a bit. For example, here are binary words of length 5:
00000, 00001, 00010, 00011, 00100, 00101, ... , 11100, 11101, 111110, 11111
There are 32 of them.
That's exactly the number of unknown bits in an unknown 5-bit binary word.
Thus, each symbol in a binary word is called a "bit," but a "bit" is also a unit of uncertainty measurement. And this example helps to understand why.
For brevity,
Definition 1.2: Information in a message is the difference in uncertainty before receiving the message and after.
This is also measured in bits.
For example, Vanya has chosen a number from 1 to 100. We are told that it is less than or equal to 50. The uncertainty before the message is
Some questions are inefficient. For instance, the question "Is the number less than or equal to 25?" reduces uncertainty by
We will denote this expression as
Definition 1.3:
Plot H(x)
It's evident that by asking binary questions, you can extract a maximum of 1 bit of information on average (the maximum value of the function
In this example, there are 100 possibilities, which means the initial uncertainty is
If we ask not binary questions but questions that imply a natural number from 1 to M as the answer, we can get more than one bit of information in a single response. If we ask a question for which all M answers are equally likely, then on average, we will get
Definition 1.4: The entropy of a discrete distribution
Here, we have overloaded the function H for the case when a discrete distribution is provided as input.
SO: The first straightforward way to arrive at the formula for entropy
Let's go further. Suppose Vanya doesn't choose a number but samples it from the distribution
Task 1.1: Study the Huffman code. Prove that a text with an original length of
So, the formula
This is the second way to arrive at formula (1).
For a random variable
There's yet another, a third, simple way to arrive at the formula for entropy, but it requires knowledge of the Stirling formula.
Task 1.2: You have an unknown binary word of length k (a sequence of ones and zeros, a total of
Answer
Approximately
Task 1.3: Vanya has an unknown word of length k in an alphabet of size M. He has provided the proportions of all the letters in the word as
What is the value of I(message) / k for large k?
Answer
So, Task 1.3 is indeed the third way to arrive at formula (1).
Definition 1.5: Information in a message about a certain random variable is the difference in entropies:
The values of a random discrete variable can be seen as letters, and each successive letter in a word is just another measurement of the random variable. So, the information in a message about a certain random variable is the number of bits of information about measurements of that random variable, normalized by the number of measurements.
Problem 1.4: What is the entropy of the discrete distribution
Answer
This result requires acceptance. How can this be? – We were given seemingly non-zero information, eliminated the most probable option from the possibilities. However, the uncertainty about the remaining possibilities remains the same, so the formula
Task 1.5: Provide an example of a finite distribution and a message that does not reduce but increases uncertainty.
Ответ
The ancient wisdom "in much wisdom is much grief" in this context gets another interesting interpretation: the modern world, science, and human life are such that new "messages" about history and the structure of the world only increase uncertainty.
Discrete distributions on a countable set of values that decay exponentially (geometric progressions) have the property of unchanged uncertainty when receiving information, meaning that among the first elements, there is no correct answer. Distributions with less than exponential decay (e.g.,
Problem 1.6: Write the formula for the entropy of the Poisson distribution
Find a simple approximation for large
Answer
Task 1.7: Given the distribution
Answer
(see definition of entropy of a continuous distribution below).
Definition 1.6: The entropy of a continuous distribution is given by
Here, we have once again overloaded the symbol H for the case when the argument is a probability density function (PDF).
Task 1.8: Given two distributions
Answer
Task 1.9: What is the entropy of a normal distribution
Answer
Task 1.10: Write down the formula for the entropy of an exponential distribution
Problem 1.11: A random variable
Let the set of values taken by
Answer
The last equality here is possible only because the supports of
In this problem, the goal was to show that even in the simple case of non-overlapping supports, entropies do not simply add up with their respective weights, but there is an additional term
The interpretation of the formula is as follows: when an outcome is measured, with a probability of
From this, it follows, in particular, that a mixture with coefficients of 1/2 of two normal random variables with the same variance, but significantly different means, has an entropy, that is 1 greater than the entropy of a single normal distribution. The supports of normal random variables span the entire real line, which means they overlap, but in the case of significantly different means, this overlap can be neglected.
Task 1.12: A random variable
Answer
There is a numerical answer to this:
It's clear that the graph starts at 0 and approaches 1:
When
, both Gaussians are identical, and the message about which one was chosen provides no information. When
, we have a mixture of two Gaussian distributions with widely separated centers. The message about the value of indicates which of the two "bell curves" the answer is in, reducing the number of possibilities by approximately half, and uncertainty decreases by 1. The "approximation" is related to the fact that the "curves" overlap, but the overlap size quickly decreases with increasing . In the vicinity of
, there is a quadratic growth, roughly . The approach to 1 occurs approximately according to the law
Task 1.13: The random variable
Task 1.14: The random variable
Task 1.15: Random variables
This problem can be formulated in the language of machine learning as follows: we have a categorical feature
Task 1.16: A random variable
Answer
The initial variance
From the initial entropy
Thus, for large
This means that the number of correct digits in the decimal representation of the number
Continuation:
Part 2 - Mutual Information: In this part, the concept of Mutual Information is explained. It opens doors to error-correcting codes, compression algorithms, and provides a new perspective on Machine Learning problems.
Part 3 - ML & Mutual Information: Fundamentals of Machine Learning in the context of information theory.