Pull to refresh

2. Information Theory + ML. Mutual Information

Reading time 11 min
Views 415
Original author: @greck

In Part 1, we became familiar with the concept of entropy.

In this part, I'm going to talk about Mutual Information - a concept that opens the doors to error-resistant coding, compression algorithms, and also provides a fresh perspective on regression and Machine Learning tasks.

This is a necessary component to transition to ML tasks as we move into the next part, treating them as challenges in extracting mutual information between features and the predicted variable. One way to explain the success of ML models is that they create a natural bottleneck, limited by a self-adjusting value of information bits, through which information about the input data is passed (distilled). But that's a topic for the next part.

Here, there will be three important visuals:

  • The first one is about visualizing the entropies of two random variables and their mutual information.

  • The second one is about understanding the concept of the dependency between two random variables and the fact that zero correlation doesn't necessarily mean independence.

  • And the third one is about how the capacity of an information channel has a straightforward geometric interpretation through the convexity measure of the entropy function.

We will also prove a simplified version of Shannon-Hartley's theorem regarding the maximum bandwidth of a noisy channel.

The material is quite complex, presented concisely, and is more like lecture notes. It is assumed that you will independently explore unclear points or ask me questions to clarify them in a more understandable and detailed manner.

2. Mutual Information

When you have two dependent variables, you can talk about how much information about one is contained in the other. The last tasks in Part 1 essentially revolved around this concept - Mutual Information between two random variables.

For example, let's consider a pair (w, h) = (person's_weight, person's_height). For simplicity, let's assume these are integers measured in kilograms and centimeters, with a finite number of possible values. In theory, we could gather data from 7 billion people and build a two-dimensional distribution for the pair (w, h) - the distribution of two dependent random variables. We can separately construct the distribution only for weight Pr(w = x) (ignoring height) and the distribution for height Pr(h = y) (ignoring weight). These two distributions are called marginal distributions for the joint distribution on the plane (x, y)

In this context, these marginal distributions are naturally referred to as prior distributions - they correspond to our knowledge of weight and height when we know nothing else about the person.

It's clear that information about a person's height will cause us to reconsider the distribution of weight. For example, the message "height = 2 meters 10 centimeters" will shift the weight distribution towards higher values. The new weight distribution after receiving this message is naturally called posterior. Accordingly, you can express the information gained in this message as the difference between the entropies of the prior and posterior distributions:

I_w("h = 2.10") =\\ = H({Pr(w = x)}_x) - H({Pr(w = x | h = 2.10)}_x)

Here, outside the curly braces, I indicate the index to iterate within the curly braces to obtain a list, and if there are two indices, it's a matrix.

It's important to note that there's no guarantee that this value will be positive. It's possible to have a joint distribution (w, h)where the conditional distribution {Pr(w = x | h = 2.10)}_xhas higher entropy (uncertainty) than the marginal distribution {Pr(w = x)}_x. However, on average, for dependent random variables, the value of I_w("h = x") is positive, namely, the expected value of this quantity is positive:

M_{I_w("h = \cdot")} = ∑_x Pr(h = x) * I_w("h = x") ≥ 0

This quantity is naturally referred to as information about w in h. Interestingly, it turns out to be symmetric with respect to the permutation of the pair (w, h).

Definition 2.1: Mutual information of two random variables is

\mathrm{MI}(w, h) = \sum_x \mathrm{Pr}(h=x) \cdot I_{w}(``h=x")


\mathrm{MI}(w, h) = \sum_x \mathrm{Pr}(w=y) \cdot I_{h}(``w=y")


\mathrm{MI}(w, h) = H(w) + H(h) - H(\{w,h\})

These are three equivalent definitions. H(\{w,h\})is the entropy of a discrete distribution where the values are not individual numbers but pairs of numbers \{y,x\}. We will prove the equivalence below.

There is a visualization of the MI value:

The entropies of random variables correspond to circles – green and reddish, with areas equal to H(w)andH(h)respectively, while the brown area of their intersection is precisely \mathrm{MI}(w,h).

Entropy as a measure

This visualization, on one hand, is merely an illustration emphasizing that entropy is a non-negative quantity and that \mathrm{MI}(w,h)is also a non-negative quantity, which is less than or equal to both H(w)andH(h). However, on the other hand, there are interesting results that allow us to construct a measure space in which a random variable corresponds to a subset, the union of subsets corresponds to the direct product of random variables (i.e., the union forms a pair), and the measure of subsets is precisely the entropy of the corresponding random variables.

By the way, in the context of machine learning, the image with green and red circles looks like this.

We are given some features, and need to predict the signal (target). For example, we need to forecast the air temperature tomorrow at 12:00 PM in the center of London with an accuracy of 0.5°C. Features can include data on the temperature in London and its surroundings for the past 10 years, the current date, current values of temperature, pressure, humidity, wind, as well as the position of the Moon and other data. It's a typical situation where the amount of information in the features is vast, but the target variable is low in entropy. The entire set of data in the features can be called a random variable "w," and the target variable is "h." These random variables have mutual information, and the essence of the prediction task is precisely to find this information within the features.

Lets describe the first expression in more detail:

\mathrm{MI}(w, h) = \sum_x \mathrm{Pr}(h=x) \cdot I_{w}(``h=x")\\= \sum_x \mathrm{Pr}(h=x) \cdot ( H(\{\mathrm{Pr}(w=y)\}_y) - H(\{\mathrm{Pr}(w=y|h=x)\}_y)) \\ =   H(\{\mathrm{Pr}(w=y)\}_y) - \sum_x \mathrm{Pr}(h=x) \cdot H(\{\mathrm{Pr}(w=y|h=x)\}_y)\\ = H(w) -  H( w | h)

The notationH(w | h) is simply a shorthand for

\sum_x P(h=x) \cdot H(\{\mathrm{Pr}(w=y|h=x)\}_y)

and is called conditional entropy.

For independent random variables,

I(``h=x") =\\= H({\mathrm{Pr}(w=y)}) - H({\mathrm{Pr}(w=y|h=x)})=0

because, by the definition of independence, \mathrm{Pr}(w=y) = \mathrm{Pr}(w=y | h=x) for any x. Therefore, for independent random variables, the mutual information is equal to 0.

It turns out that the reverse is also true, meaning the statement \mathrm{MI}(w, h) = 0 is equivalent to the independence of random variables. However, a similar statement would not hold true for the correlation of two random variables.

To see the equivalence of the MI definitions, it's convenient to introduce the following notations:

  • P = \{ p_{x,y} \}_{x,y} — probabilities that height and weight are equal to (x, y).

  • R = \{ r_{x} \}_x = \{\sum_y p_{x,y}\}_x — probabilities that height equals x (marginal distribution of height).

  • Q = \{ q_{y} \}_y = \{\sum_x p_{x,y}\}_y — probabilities that weight equals y (marginal distribution of weight).

Let's assume that all these numbers are non-zero.

Firstly, notice that \mathrm{Pr}(w=y | h=x) = p_{x, y} / r_x.

Then, by making substitutions and simple transformations, we obtain the equivalence between the first and third definitions:

\mathrm{MI}(w, h) =\\=\sum_x \mathrm{Pr}(h=x)  \cdot (H(\{q_y\}_y)  - H(\{\mathrm{Pr}(w=y | h=x)\}_y ) = \\ \ \ = \sum_x r_x \cdot (H(Q)  - \sum_y p_{x,y}/r_x \cdot \log (r_x/p_{x,y})) = \\ \ \ = H(Q) -  \sum_{x,y} r_x \cdot  (p_{x,y} / r_x) \cdot\log( r_x / p_{x,y}) = \\ \ \ = H(Q) -  \sum_{x,y} p_{x,y} \log (r_x / p_{x,y}) = \\ \ \ = H(Q) +  \sum_{x,y} p_{x,y} \log (1 / r_x) - \sum_{x,y} p_{x,y} \log (1 /p_{x,y}) = \\ \ \ = H(Q) + \sum_{x} r_x \log (1 / r_x) - \sum_{x,y} p_{x,y} \log (1/p_{x,y}) = \\ \ \ = H(Q) + H(R) - H(P)

The consideration of cases where some of the probabilities are equal to zero lets leave for textbooks.

Task 2.1: Provide an example of random variables for which the correlation is zero, but MI is not zero.

Task 2.2: Two random variables were measured multiple times, and points were plotted on a plane. Which pictures correspond to dependent random variables, and which ones correspond to independent random variables? For which of them is the correlation between x and y equal to 0?

Measurement results of two random variables. On which ones of the 12 pictures are these two variables dependent?
Measurement results of two random variables.
On which ones of the 12 pictures are these two variables dependent?

Dependent: 3rd, 4th, 5th, 8th, 11th, 12th.

Correlation is zero for all except the 4th, 5th, 8th, and 12th.

Task 2.3: Calculate MI(w="number is divisible by a=6", h="number is divisible by b=15"). It is assumed that we randomly select one of the natural numbers, and all numbers are equally likely. To avoid dealing with the concept of a uniform distribution over natural numbers, consider that we are randomly selecting a number from the set {1, ..., 30}. Prove that if you take coprime numbers a and b, then MI=0.


We know the marginal distributions: R = {5/6, 1/6} and Q = {14/15, 1/15}. Furthermore, we know that the probability of being divisible by both 6 and 15 is 1/30. From this, we can derive the joint probability matrix:

P = \|p_{xy}\| =\\=\left\|\begin{array}{cc}  5/6 - (1/15 - 1/30)   & 1/6 - 1/30 \\ 1/15 - 1/30 & 1/30\end{array}\right\|=\\=\left\|\begin{array}{cc}  4/5  & 2/15\\ 1/30  & 1/30\end{array}\right\|

We use the formula \mathrm{MI} = H(w) + H(h) - H({w,h}) and get MI = 0.0311278.

Task 2.4: Prove that there is another equivalent definition

\mathrm{MI}(w, h) = D_{KL}( \{\mathrm{Pr}(w = x, h = y)\}_{x,y}, \;\{\mathrm{Pr}(w=x) \cdot \mathrm{Pr}(h=y) \}_{x,y}),

which means that MI measures how much less efficient the Huffman code for the stream of pairs \{w,h\}, constructed under the assumption of independence of random variables (i.e., assuming that the joint distribution is equal to the product of marginals), is compared to the Huffman code constructed based on the actual joint distribution. It is measured in saved bits per symbol (symbol being an observation of the pair \{w,h\}).

We have been operating in the realm of discrete distributions so far. Transitioning to continuous variables is straightforward.

Let's recall task 1.7. Suppose we have real random variables (w, h), but we discretize them - the first one into bins of size \varepsilon_1, and the second one into bins of size \varepsilon_2​. Substituting into the expression

H(\{q_y\}_y) + H(\{r_x\}_x) - H(\{p_{x,y}\}_{x,y})

an approximate formula instead of H (see task 1.7)

\log(1/\varepsilon) - \int_{-\infty}^{+\infty} \rho(x)\log(\rho(x))dx

for the entropy of discretized continuous distribution, we will get that the first term contributes \log(1/\varepsilon_1), the second term contributes \log(1/\varepsilon_2),and the third term (the subtracted one) contributes \log(1/(\varepsilon_1\cdot \varepsilon_2)and these three pieces will cancel each other out.

Thus, mutual information for continuous variables can be naturally defined as the limit of MI for discretized random variables as the bin sizes approach zero.

Definition 2.2: The mutual information of two continuous random variables is given by

\mathrm{MI}(h,w) = H(\rho_h) + H(\rho_w) - H(\rho_{(h,w)}).

Here, we use the H function from the definition 1.6 in part 1.

Task 2.5: Estimate \mathrm{MI}(predict, target)for any forecasting task you have. Calculate \mathrm{MI}(predict, target) / H(target), and how close is this ratio to 1?

The estimation of MI between two random variables from a set of their measurements essentially boils down to calculating the joint distribution matrix\|p_{xy}\|for discretized values of these random variables. The more bins in the discretization, the less inaccuracy associated with discretization, but the less statistics available for an accurate estimation of p_{xy}​.

The estimation of MI from observed measurements is a separate extensive topic, and we will return to it in problem 3.1.

Problem 2.6: Prove that for any strictly monotonic functionf, it holds true that \mathrm{MI}(w,h) = \mathrm{MI}(f(w), h).


Let w_f = f(w). Consider two formulas:

\mathrm{MI}(w, h) = H(h) - H(h | w) =\\= H(h) - \int (\rho(y) dy)  H(h | w = y)\mathrm{MI}(w_f, h) = H(h) - H(h | w_f) =\\= H(h) - \int (\rho_f(z) dz)  H(h | w_f = z)

The first terms in both formulas are identical, but subtracted terms are equal since their physical interpretation is the average value of H(h'), whereh'are different random variables gotten from h when fixing the value of w or, equivalently, when fixing the value of w_f. The averaging is done over the distribution of values of w or w_f, which is not important.

Now, look at the integrals. The values H(h | w = y)andH(h | w_f = f(y))are just identical, and in both formulas, you are averaging this quantity over the same probability measure \mu, which is just parameterized differently: d\mu = \rho(x) dx = \rho_f(z) dz.

Definition 2.6: An information channel is a Markov chain of length 2, that defines the dependency between two related random variables, one of which is called the input, and the other is called the output. Often, when referring to an information channel, only the transition probability matrix is considered: T=\|t_{y,x}\|=\|\{\mathrm{Pr}(output = y | \; input = x)\}_{y,x}\|, , without specifying the input distribution. We will denote the input distribution as a vector R= \|\{r_x\}_x\|=\|\{\mathrm{Pr}(input=x)\}_x\|,​, and the distribution of output values as a vector Q = \|\{q_y\}_y\|=\|\{\mathrm{Pr}(output=y)\}_y\|.

For discrete distributions, for each of the M possible inputs, we have a distribution over N possible output values. In essence, we have a matrix with M columns and N rows, where all numbers are non-negative, and the sum of numbers in each column equals 1.

Such matrices are called stochastic matrices, more precisely, left stochastic matrices (left stochastic matrix).

About left and right stochastic matrices:

In left stochastic matrices, the sum of numbers in each column equals 1, while in right stochastic matrices, it's the sum of numbers in each row that equals 1. There are also doubly stochastic matrices, where both columns and rows sum to 1. I have chosen the variant where the vector of input probabilities is considered as a column vector, and to obtain the vector of output probabilities, you perform the standard matrix multiplication: Q  = T \cdot R.

In English-language literature, it's common to consider the probability vector as a row vector, transpose the transition matrix (making it a right stochastic matrix), and perform matrix multiplication from the left: Q=R\cdot T. I'm aware that this might be unfamiliar to many, so I've opted for the column vectors and left stochastic transition matrices.

Thus, an information channel is essentially a left stochastic matrix, denoted as T . By specifying the distribution r_x = \mathrm{Pr}(input=x)at the input, you obtain the distribution q_y = \mathrm{Pr}(output=y)at the output simply by multiplying the matrix T= \|t_{y,x}\|by the vector R:

q_y = \sum_x t_{yx} \cdot r_x

The capacityC(T)of an information channel Tis defined as the maximum value of \mathrm{MI}(input, output)achievable for some input distributionR=\{r_x\}_xat the input.

An information channel for transmitting bits, where 1 turns into 0 with a probability of \varepsilon_1,and 0 turns into 1 with a probability of \varepsilon_0​, is defined by the matrix:

T = \left\|\begin{array}{cc} 1-\varepsilon_0 &  \varepsilon_1\\ \varepsilon_0 & 1-\varepsilon_1 \end{array}\right\|

Task 2.7: Let's consider an information channel with noise where 1 transforms into 0 with a probability of \varepsilon_1,and 0 transforms into 1 with a probability of \varepsilon_0. What is the value of \mathrm{MI}(input, \;output)when the input is a random bit with distribution R=\{0.5,  0.5\}? What input distribution maximizes \mathrm{MI}(input, output), and what is its value? In other words, what is the capacity of this channel?


Here, it is convenient to use the formula:

\mathrm{MI} =  H(Q) - \sum_x p_x \cdot H( \{\mathrm{Pr}(output = y | \; input = x)\}_y ).

We have:

  • Input distribution:
    R = \{r_0, r_1\}

  • Output distribution:
    Q = \{q_0, q_1\}  = \{r_0 \cdot (1 - \varepsilon_0) +  r_1 \cdot \varepsilon_1,  r_1 \cdot (1 - \varepsilon_1) +  r_0 \cdot \varepsilon_0\}

  • Joint distribution, i.e., distribution over pairs (input, output): P =\{\{r_0 \cdot (1 - \varepsilon_0),   r_1 \cdot \varepsilon_1\}, \{r_1 \cdot (1 - \varepsilon_1),   r_0 \cdot \varepsilon_0 \}\}

  • And two conditional distributions:

    • \{\mathrm{Pr}(output = y|\; input = 0)\}_y = \{1 - \varepsilon_0, \varepsilon_0 \},

    • \{\mathrm{Pr}(output = y|\; input = 0)\}_y = \{1 - \varepsilon_0, \varepsilon_0 \},

Using the formula for MI, we get:

MI = H(Q) - r_0 \cdot H(\{1 - \varepsilon_0, \varepsilon_0 \}) - r_1\cdot H(\{\varepsilon_1, 1 - \varepsilon_1\})

This formula is nothing but a measure of the convexity of the function H(x) (aka JSD, see below) on the interval x \in [\varepsilon_1, 1 - \varepsilon_0]. In other words, it's the value of this function on the convex combination of the abscissas of the endpoints of this interval minus the convex combination of the values of this function at the endpoints (ordinates of the interval).

The figure shows the function f(x) = H(\{x, 1 -x \}). The KL interval is divided by point A in the ratio of r_0:r_1​. The length of segment AB is the value of MI.

The maximum length of the segment is reached at a point where the derivativef'equals the slope of the interval.

Generalizing problem 2.7 to the multidimensional case, we get:

Statement 2.1: The channel capacity is determined by the convexity measure of the function f(x) = H({x_1, x_2, \ldots, x_k})within a simplex. This simplex is the image of a stochastic transformation defined by the information channel. In particular, it's the maximum value of the difference betweenf from an affine (also known as convex) combination of the vertices of the simplex, the vertices of which are defined by the columns of the stochastic matrix, and the affine combination of the values of the functionf at the vertices of the simplex.The maximum is taken over all possible weights that define the affine sum. In the definition below, the weights are denoted as \{\pi_1, \ldots, \pi_n\}, and the vertices of the simplex are denoted as {P_1, P_2, \ldots, P_n}.

Interestingly, the channel capacity can be viewed as a measure of the volume enclosed within the simplex, defined by the set of conditional distributions as its vertices.

Definition 2.4: Let there are several distributions{P_1, P_2, \ldots, P_n}over the same set of values. The Jensen–Shannon Divergence (JSD) of this set with weights\{\pi_1, \ldots, \pi_n\}is computed as\mathrm{MI}(input, output)for:

R=\{\pi_1, \ldots,\pi_n\}, \\ T=\|P_1,P_2,\ldots, P_n\|


{\rm JSD}_{\pi_1, \ldots, \pi_n}(P_1, P_2, \ldots, P_n) =\\= H\left(\sum_{i=1}^n \pi_i P_i\right) - \sum_{i=1}^n \pi_i H(P_i).

The more the distributions differ from each other, the greater the JSD. The maximum

\max_{\pi_1, \ldots, \pi_n} {\rm JSD}{\pi_1, \ldots, \pi_n}(P_1, P_2, \ldots, P_n)  = C(T)

is the channel capacity of channel T=\|P_1,P_2,\ldots, P_n\|.

Problem 2.8: Let there are two dependent normal random variables(w, h)with variances \sigma_w^2 and\sigma_h^2, respectively. And let the first one is obtained from the second by adding independent normal noise with variance \sigma^2_nw = h + noise.Then \sigma_w^2 = \sigma_h^2 + \sigma_n^2.What is the mutual information \mathrm{MI}(w, h)?


The entropies of w and h are as follows:

{1 \over 2}\cdot (1+\log(2\pi (\sigma_h^2 + \sigma_n^2))\\\;\;\;\mathrm{ and } \;\;\;\;{1 \over 2}\cdot (1+\log(2\pi \sigma_h^2))

The entropy of the pair of real random variables \{w, h\} is:

H(\rho_{(w,h)})={1\over 2}(1 + \log(2\pi \sigma_h^2)) + {1\over 2}(1 + \log(2\pi \sigma_n^2))

The last expression is obtained by summing the entropies of the two-dimensional distribution of the pair\{h,noise\}(the entropy of independent pairs is the sum of their entropies). The entropy of the pair\{h, w\}=\{h,h+noise\}is equal to the entropy of the pair\{h,noise\}, because a linear transformation of a vector of random variables using the matrixA results in a vector with the same entropy plus\log(\det(A))(you can prove this!). The determinant of our linear transformation is equal to 1.

Explanation about the matrix: here, matrix A corresponds to the array {{1,0},{1,1}}:

\left\|\begin{array}{l}h\\w\end{array}\right\|=\left\|\begin{array}{l}1\;\;0\\1\;\;1\end{array}\right\| \cdot \left\|\begin{array}{l}h\\noise\end{array}\right\|

and its determinant is 1. The value of the determinant \det(A)is the factor by which the volume of a cube in n-dimensional space changes (in our case, n=2). The term \log(\det(A))arises here for the same reason that the term \log(1/\varepsilon)arises in the entropy formula in task 1.7 for discretized real random variables. End of explanation.

Finally, using the formula \mathrm{MI}(h,w) = H(\rho_h) + H(\rho_w) - H(\rho_{(h,w)}), we get:

\mathrm{MI}(w, h) = {1 \over 2}\log(1+ {\sigma_h^2 \over \sigma_n^2})

Note 1: When the variance (power) of noise is equal to the variance (power) of the signal w, each measurement of w contains 0.5 bits of information about the value of h.

Note 2: In essence, this result is Shannon-Hartley theorem in a simplified form.

Task 2.9: Suppose there are two dependent random variables (w, h_k), where w follows an exponential distribution with parameter λ, and h_k = floor(kw) / k. What is the mutual information \mathrm{MI}(w, h_k)? In other words, how much information is preserved when rounding an exponential random variable to a certain decimal place? Compare \mathrm{MI}(w, h_{10})and \mathrm{MI}(w, h_{100}).


Understanding task: The second random variable is uniquely determined by the first one. This means that the mutual information (MI) is simply equal to the entropy of the second variable. The distribution of the second random variable is a geometric progression with q=\lambda^{1/k}, and its entropy isH(\lambda^{1/k})/(1 - \lambda^{1/k})(see task 1.4).

Task 2.10: Let's consider two dependent random variables(w, h_k), where w follows a beta distribution with parameters (\alpha,\;\beta), and h_kis sampled from a binomial distribution with parameters (p, n) = (w, k). What is the mutual information \mathrm{MI}(w, h_k)? In other words, how many bits of information about the true click-through rate (CTR) of an advertisement can be extracted on average from the click statistics over k impressions?

Task 2.11: Two dependent random variables(w, h)are generated as follows – first, a random variable λ is sampled from an exponential distribution with mean λ_0, and then two Poisson random variables(w, h)are sampled with a parameter λ. What is the mutual information \mathrm{MI}(w, h)? One of the possible interpretations of this problem is as follows: what is the mutual information between the number of sales in one week and the number of sales in another week for a certain unknown product?

The last tasks are an example of how dependencies between random variables can be modeled using graphical probabilistic models, particularly Bayesian networks.

Task 2.12: The random variable targetis gotten from independent random variables \{f_1,\ldots, f_n,\; noise\}with a distribution of \mathcal{N}(0,1)using the formula:

target=\sum_i w_i\cdot f_i+w_{noise}\cdot noise

The constant weights\{w_1,\ldots,w_n\}are unknown to you, but your prior knowledge about them is that they were independently sampled from \mathcal{N}(0,1). You decide to estimate the weights\{\hat{w_i}\}_i using classical regression methods and make a prediction:

predict = \sum_i \hat{w_i}\cdot f_i

How will the mean squared error mse = M_{(target - predict)^2} and the value of \mathrm{MI}(predict, target)change with an increase in the size of the training dataset? Analyze the answer for the case where w_{noise}=0. .

Comments 0
Comments Leave a comment