# Measuring Information¶

CMPy implements many measures and functions from information theory. The vast majority of the following values can be read about further in [cover2006].

## Entropy¶

Entropy is an uncertainty measure defined by . Intuitively the entropy gives the average number of yes/no questions that must be asked to determine what event occured from a distributions.

Note

This entropy function can calculate the joint entropy if the distribution passed to it is a joint distribution.

For example, the average number of yes-no questions that must be asked to determine what face an unfair die exposed:

```
>>> die = Distribution({1: 1/6, 2: 1/6, 3: 1/6, 4: 1/6, 5: 1/6, 6: 1/6})
>>> loaded_die = Distribution({1: 3/24, 2: 6/24, 3: 3/24, 4: 3/24, 5: 6/24, 6: 3/24})
>>> die.entropy()
2.5849625007211561
>>> loaded_die.entropy()
2.5
```

So we see that biasing the die along the 2-5 axis reduces the entropy. We can also look at the length 3 block entropy for the even process:

```
>>> words = Distribution({ '000': 1/12,
... '001': 1/12,
... '011': 1/6,
... '100': 1/12,
... '101': 1/12,
... '110': 1/6,
... '111': 1/3,
... })
>>> words.entropy()
2.5849625007211561
```

Interestingly enough, the definitely-skewed even process’s words of length three distribution contains the same amount of entropy – uncertainty – as a fair die.

For example, if we consider the RRXOR distribution , , and :

```
>>> rrxor = Distribution({'000': 1/4, '011': 1/4, '101': 1/4, '110': 1/4})
>>> rrxor.entropy()
2.0
>>> rrxor.marginal([0]).entropy()
1.0
>>> rrxor.marginal([1]).entropy()
1.0
>>> rrxor.marginal([2]).entropy()
1.0
```

so we see that the entire distribution contains two bits of information, and that each variable when taken on its own seems to have one bit of information.

## Perplexity¶

Perplexity is an uncertainty measure defined by . It is approximately the size of the typical set – that is, it is approximately the number of sequences you are likely to see.

Let us revisit the loaded die example from before:

```
>>> die = Distribution({1: 1/6, 2: 1/6, 3: 1/6, 4: 1/6, 5: 1/6, 6: 1/6})
>>> loaded_die = Distribution({1: 3/24, 2: 6/24, 3: 3/24, 4: 3/24, 5: 6/24, 6: 3/24})
>>> die.perplexity()
6.0000000000000018
>>> loaded_die.perplexity()
5.6568542494923815
```

So we see that the loaded die has a smaller typical set, or effective alphabet size, than the fair die.

## Conditional Entropy¶

Conditional Entropy is the entropy remaining in a random variable given information about another random variable.

Going back to our RRXOR example, let us now look at how much information the variables conditionally contain:

```
>>> rrxor.conditional_entropy([0,1])
0.0
>>> rrxor.conditional_entropy([0,2])
0.0
>>> rrxor.conditional_entropy([1,2])
0.0
```

that is, given any two variables, the third contains no information of its own. This shows us that any information in this system must be shared, none is kept to a single random variable.

## Cross Entropy¶

Cross Entropy is approximately the amount of information needed to encode X if one uses the distribution Y.

From the rightmost form, we can interpret it as the inherent uncertainty in X plus a penalty accrued for using Y instead of X.

## Relative Entropy¶

Relative Entropy, also known as the *Kullbach-Leibler Divergence*, is a
non-metric measure of the distance between two distributions. It notably is not
symmetric nor does it satisfy the triangle inequality. The relative entropy
arises naturally in many information-theoretic situations.

As an example, the likelihood of flipping a fair coin 100 times and seeing 90 heads can be approximated easily using the relative entropy:

```
>>> fair = Distribution({0: 1/2, 1: 1/2})
>>> bias = Distribution({0: 1/10, 1: 9/10})
>>> 2**(-100*bias.relative_entropy(fair))
1.0355583719533411e-16
```

Whereas the likelihood of seeing a 50/50 outcome from a coin biased to a 90% likelihood of heads would be

```
>>> 2**(-100*fair.relative_entropy(bias))
6.533186235000712e-23
```

## Mutual Information¶

Mutual Information is the amount of information shared by two random variables.

Going back to our RRXOR example, we have already seen that each variable contains a total of one bit, but that they don’t keep that bit all to themselves. Therefore that bit must be shared, right?:

```
>>> rrxor.mutual_information([0], [1])
0.0
>>> rrxor.mutual_information([0], [2])
0.0
>>> rrxor.mutual_information([1], [2])
0.0
```

Uh oh, what’s going on? None of the variables hoard their bit to themselves, but nor are they sharing anything with any of the other variables.

## Conditional Mutual Information¶

Conditional Mutual Information

Let’s revisit RRXOR once again:

```
>>> rrxor.conditional_mutual_information([0], [1], [2])
1.0
>>> rrxor.conditional_mutual_information([0], [2], [1])
1.0
>>> rrxor.conditional_mutual_information([1], [2], [0])
1.0
```

Now we’re getting somewhere! So they *are* sharing information, but only
conditionally. We’ve found our three bits, but that’s still too much – the
entire distribution only has two bits of information. We’re missing one more
piece.

## Multivariate Mutual Information¶

Multivariate mutual information is implemented in the most natural way as the
Co-Information . In a nutshell, it is the amount of information shared among
*all* of the variables – in an analog to set theory, it is
– so if even one of the variables therein is completely independent of all
others the multivariate mutual information will be zero.

Let’s look at the RRXOR example one last time:

```
>>> rrxor.mutual_information([0], [1], [2])
-1.0
```

And there it is!

Further information about the coinformation can be found in [bell2003].

## Interaction Information¶

Interaction Information is another form of the multivariate mutual information, and is very similar to the coinformation, but defined in a slightly different manner which results in a sign change when the number of variables is odd.

So for example, the interaction information for the RRXOR process has the opposite sign compared to the coinformation:

```
>>> rrxor.interaction_information([0], [1], [2])
1.0
```

Further information about the interaction information can be found in [jakulin2004].

## Total Correlation¶

Total Correlation, also known as the *multiinformation* is any and all
redundancy contained in a set of variables. It is equal to the sum of all
pairwise mutual informations, plus all three-way mutual informations, plus all
four-way mutual informations...

Again looking at the RRXOR distribution:

```
>>> rrxor.total_correlation([0], [1], [2])
1.0
```

Further information about the total correlation can be found in [wanatabe1960].

## Binding Information¶

The binding information is similar to the Total Correlation, except that it does not count higher-level mutual informations multiple times:

Looking at the RRXOR distribution:

```
>>> rrxor.binding_information()
2.0
```

## In Summary¶

To sum up what we’ve learned about the RRXOR distribution, we can utilize a special plotting method in cmpy.infotheory.plotting:

```
>>> from cmpy.infotheory.plotting import plot_3i
>>> plot_3i(rrxor)
```