Working With Distributions

Normalizing & Validating

Sometimes you may construct an invalid distribution. It can be invalid in a couple of different ways, and the validate method will tell you if you’ve violated anything:

>>> d = Distribution({'A': 0.5, 'B': 0.5, 'C': 0.5})
>>> d.validate()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/autoplectic/Python/lib/python2.7/site-packages/cmpy/infotheory/distributions.py", line 633, in validate
    raise InvalidDistribution(total)
cmpy.infotheory.exceptions.InvalidDistribution: Distribution is improperly normalized. Summation was 1.5.
>>> e = Distribution({'A': -0.5, 'B': 1.5})
>>> e.validate()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/autoplectic/Python/lib/python2.7/site-packages/cmpy/infotheory/distributions.py", line 635, in validate
    raise InvalidProbability(vals)
cmpy.infotheory.exceptions.InvalidProbability: Distribution has probabilities outside [0,1].

For the first of these errors, normalize will fix the issue:

>>> d.normalize()
>>> d
Distribution:
{'A': 0.3333333333333333, 'C': 0.3333333333333333, 'B': 0.3333333333333333}
>>> d.validate()
True

The second error from above can not be trivially fixed – the user will have to figure out what they really want. It is possible what was intended was a GeneralizedDistribution, which is valid with such probabilities:

>>> e = GeneralizedDistribution({'A': -0.5, 'B': 1.5})
>>> e.validate()
True

Converting Between Distribution Types

Given a standard Distribution, it can be converted to a LogDistribution as such:

>>> d = {'A': 1/2, 'B': 1/4, 'C': 1/8, 'D': 1/8}
>>> l = d.to_log()
>>> l
LogDistribution:
{'A': -1.0, 'C': -3.0, 'B': -2.0, 'D': -3.0}

And LogDistributions can be converted to Distributions similarly:

>>> d = l.to_dist()
>>> d
Distribution:
{'A': 0.5, 'C': 0.125, 'B': 0.25, 'D': 0.125}

SymbolicDistribution types can be converted to either Distributions or LogDistributions by supplying a mapping from the variables to numeric values:

>>> p, q, r = symbols('pqr')
>>> s = SymbolicDistribution({'A': p, 'B': q, 'C': r, 'D': r})
>>> s.to_dist({p: 1/2, q: 1/4, r: 1/8})
Distribution:
{'A': 0.5, 'C': 0.125, 'B': 0.25, 'D': 0.125}
>>> s.to_log({p: 1/2, q: 1/4, r: 1/8})
LogDistribution:
{'A': -1.0, 'C': -3.0, 'B': -2.0, 'D': -3.0}

Note

Notice that we still supplied non-log values as p, q, and r in the to_log method. Symbolically log(p) is substituted in for p prior to the numeric substitution to retain accuracy.

Marginalizing

If your distribution is a joint distribution it can be used to construct a marginal distribution over any set of sub-events in the joint:

>>> words = Distribution({'000': 1/4, '011': 1/4, '101': 1/4, '110': 1/4})
>>> first = words.marginal([0])
>>> first
Distribution:
{'1**': 0.5, '0**': 0.5}
>>> words.marginal([0, 2])
Distribution:
{'1*0': 0.25, '1*1': 0.25, '0*1': 0.25, '0*0': 0.25}

If you would rather not keep wildcard placeholders in your marginal distribution the optional keyword clean can be specified:

>>> words.marginalize([1], clean=True)
Distribution:
{'1': 0.5, '0': 0.5}

To do the opposite – supply the indices to sum over and throw away, use marginalize:

>>> first = words.marginalize([1, 2])
>>> first
Distribution:
{'1**': 0.5, '0**': 0.5}

Conditioning

Conditional distributions can be constructed from joint distributions much the way marginalization works:

>>> words.condition_on([0])
ConditionalDistribution:
{'1**': Distribution:
{'*01': 0.5, '*10': 0.5}, '0**': Distribution:
{'*00': 0.5, '*11': 0.5}}

So a conditional distribution is a mapping from marginalized events to distributions over the remaining sub-events from the joint. Conditioning supports the clean optional parameter also:

>>> words.condition_on([0,1], clean=True)
ConditionalDistribution:
{'11': Distribution:
{'0': 1.0}, '10': Distribution:
{'1': 1.0}, '00': Distribution:
{'0': 1.0}, '01': Distribution:
{'1': 1.0}}