Measuring Information

CMPy implements many measures and functions from information theory. The vast majority of the following values can be read about further in [cover2006].


Entropy is an uncertainty measure defined by H(X) = \sum p_i
log_2(p_i). Intuitively the entropy gives the average number of yes/no questions that must be asked to determine what event occured from a distributions.


This entropy function can calculate the joint entropy if the distribution passed to it is a joint distribution.

For example, the average number of yes-no questions that must be asked to determine what face an unfair die exposed:

>>> die = Distribution({1: 1/6, 2: 1/6, 3: 1/6, 4: 1/6, 5: 1/6, 6: 1/6})
>>> loaded_die = Distribution({1: 3/24, 2: 6/24, 3: 3/24, 4: 3/24, 5: 6/24, 6: 3/24})
>>> die.entropy()
>>> loaded_die.entropy()

So we see that biasing the die along the 2-5 axis reduces the entropy. We can also look at the length 3 block entropy for the even process:

>>> words = Distribution({ '000': 1/12,
...                        '001': 1/12,
...                        '011': 1/6,
...                        '100': 1/12,
...                        '101': 1/12,
...                        '110': 1/6,
...                        '111': 1/3,
...                       })
>>> words.entropy()

Interestingly enough, the definitely-skewed even process’s words of length three distribution contains the same amount of entropy – uncertainty – as a fair die.

For example, if we consider the RRXOR distribution X = \{\frac{1}{2}, \frac{1}{2}\}, Y = \{\frac{1}{2}, \frac{1}{2}\}, and Z = X \oplus Y:

>>> rrxor = Distribution({'000': 1/4, '011': 1/4, '101': 1/4, '110': 1/4})
>>> rrxor.entropy()
>>> rrxor.marginal([0]).entropy()
>>> rrxor.marginal([1]).entropy()
>>> rrxor.marginal([2]).entropy()

so we see that the entire distribution contains two bits of information, and that each variable when taken on its own seems to have one bit of information.


Perplexity is an uncertainty measure defined by 2^{H(X)} = \prod
p_i^{-p_i}. It is approximately the size of the typical set – that is, it is approximately the number of sequences you are likely to see.

Let us revisit the loaded die example from before:

>>> die = Distribution({1: 1/6, 2: 1/6, 3: 1/6, 4: 1/6, 5: 1/6, 6: 1/6})
>>> loaded_die = Distribution({1: 3/24, 2: 6/24, 3: 3/24, 4: 3/24, 5: 6/24, 6: 3/24})
>>> die.perplexity()
>>> loaded_die.perplexity()

So we see that the loaded die has a smaller typical set, or effective alphabet size, than the fair die.

Conditional Entropy

Conditional Entropy is the entropy remaining in a random variable given information about another random variable.

H(X|Y) = H(X, Y) - H(Y) = H(X) - I(X, Y)

Going back to our RRXOR example, let us now look at how much information the variables conditionally contain:

>>> rrxor.conditional_entropy([0,1])
>>> rrxor.conditional_entropy([0,2])
>>> rrxor.conditional_entropy([1,2])

that is, given any two variables, the third contains no information of its own. This shows us that any information in this system must be shared, none is kept to a single random variable.

Cross Entropy

Cross Entropy is approximately the amount of information needed to encode X if one uses the distribution Y.

xH(X, Y) = \sum p_i \log_2(q_i) = H(X) + D(X||Y)

From the rightmost form, we can interpret it as the inherent uncertainty in X plus a penalty accrued for using Y instead of X.

Relative Entropy

Relative Entropy, also known as the Kullbach-Leibler Divergence, is a non-metric measure of the distance between two distributions. It notably is not symmetric nor does it satisfy the triangle inequality. The relative entropy arises naturally in many information-theoretic situations.

D(X||Y) = \sum p_i \log_2(\frac{p_i}{q_i}) = xH(X, X/Y)

As an example, the likelihood of flipping a fair coin 100 times and seeing 90 heads can be approximated easily using the relative entropy:

>>> fair = Distribution({0: 1/2, 1: 1/2})
>>> bias = Distribution({0: 1/10, 1: 9/10})
>>> 2**(-100*bias.relative_entropy(fair))

Whereas the likelihood of seeing a 50/50 outcome from a coin biased to a 90% likelihood of heads would be

>>> 2**(-100*fair.relative_entropy(bias))

Mutual Information

Mutual Information is the amount of information shared by two random variables.

I(X; Y) = \sum p(x,y) log(\frac{p(x,y)}{p(x)p(y)}) = D(XY||X,Y) =
H(X, Y) - H(X|Y)

Going back to our RRXOR example, we have already seen that each variable contains a total of one bit, but that they don’t keep that bit all to themselves. Therefore that bit must be shared, right?:

>>> rrxor.mutual_information([0], [1])
>>> rrxor.mutual_information([0], [2])
>>> rrxor.mutual_information([1], [2])

Uh oh, what’s going on? None of the variables hoard their bit to themselves, but nor are they sharing anything with any of the other variables.

Conditional Mutual Information

Conditional Mutual Information

I(X;Y|Z) = I(X;Y) - I(X;Y;Z)

Let’s revisit RRXOR once again:

>>> rrxor.conditional_mutual_information([0], [1], [2])
>>> rrxor.conditional_mutual_information([0], [2], [1])
>>> rrxor.conditional_mutual_information([1], [2], [0])

Now we’re getting somewhere! So they are sharing information, but only conditionally. We’ve found our three bits, but that’s still too much – the entire distribution only has two bits of information. We’re missing one more piece.

Multivariate Mutual Information

Multivariate mutual information is implemented in the most natural way as the Co-Information . In a nutshell, it is the amount of information shared among all of the variables – in an analog to set theory, it is \bigcup X_i – so if even one of the variables therein is completely independent of all others the multivariate mutual information will be zero.

I(X_1; X_2;\ldots;X_n) = -\sum_{T\subseteq \{1,2,\ldots,n\}} (-1)^{|T|}H(T)

Let’s look at the RRXOR example one last time:

>>> rrxor.mutual_information([0], [1], [2])

And there it is!

Further information about the coinformation can be found in [bell2003].

Interaction Information

Interaction Information is another form of the multivariate mutual information, and is very similar to the coinformation, but defined in a slightly different manner which results in a sign change when the number of variables is odd.

II(X_1; X_2; \ldots; X_n) = -\sum_{T\subseteq \{1,2,\ldots,n\}}
(-1)^{n-|T|}H(T) = (-1)^{n} I(X_1; X_2;\ldots;X_n)

So for example, the interaction information for the RRXOR process has the opposite sign compared to the coinformation:

>>> rrxor.interaction_information([0], [1], [2])

Further information about the interaction information can be found in [jakulin2004].

Total Correlation

Total Correlation, also known as the multiinformation is any and all redundancy contained in a set of variables. It is equal to the sum of all pairwise mutual informations, plus all three-way mutual informations, plus all four-way mutual informations...

C(X_1; X_2; \ldots; X_n) = \sum[H(X_i)] - H(X_1, X_2, \ldots, X_n)

Again looking at the RRXOR distribution:

>>> rrxor.total_correlation([0], [1], [2])

Further information about the total correlation can be found in [wanatabe1960].

Binding Information

The binding information is similar to the Total Correlation, except that it does not count higher-level mutual informations multiple times:

B(X_1; X_2; \ldots; X_n) = H(X_1, X_2, \ldots, X_n) - \sum_{t \in 1 \ldots
n} H(X_t | X_{1 \ldots n - t})

Looking at the RRXOR distribution:

>>> rrxor.binding_information()

In Summary

To sum up what we’ve learned about the RRXOR distribution, we can utilize a special plotting method in cmpy.infotheory.plotting:

>>> from cmpy.infotheory.plotting import plot_3i
>>> plot_3i(rrxor)