# Infer machine¶

In this notebook we infer the start state/node and transition probabilities for a machine. This is not structural inference because we assume a single structure/topology as given. However, elements of this level of inference are used as the basis of structural inference, to be demonstrated in another notebook.

To start, import code from CMPy

In[1]:

import cmpy
from cmpy.inference.bayesianem import InferEM


## Machine – data source and topology¶

Choose a data source of interest and draw the machine.

In[2]:

# choose a machine here
eM = cmpy.machines.EvenOdd()

# draw the machine
eM.draw(show='ipynb')


Out[2]:

<IPython.core.display.Image at 0x3993110>


## Prior¶

The prior provides distributions over our a priori knowledge of properties to be inferred. We assume the machine topology/structure is given, but that

1. start state/node is unknown
1. transitition probabilities (not defined to be 1.0) are unknown.

In the code, we choose very diffuse (non-informative) priors.

To investigate the prior, choose the InferEM class and pass None instead of data:

In[3]:

prior = InferEM(eM,None)


Use the summary_string() method to see what the prior assigns. Notice:

• All start states/nodes should be equally likely
• All outgoing transition probabilities, given a start state, should also be (on average) equal
• Transition probabilities are conditioned on each start state – the prior should reflect priors over transition probabilities for an -state machine

Values for the parameters of the prior are also provided:

• : weight for edge
• : weight for node (sum of values for outgoing edges of node)

A value of None means the topology defines a transistion probability to 1.0 and this is not a parameter to be inferred.

Similar output for observed data (there is none at this point) are given by

• : counts for edge
• : counts for node (sum of values for outgoing edges of node)

All of these are None because there is no data for the prior. Transition probabilties are given as the expectation with respect to the prior. For a given edge, that means

In[4]:

print prior.summary_string()

eMachine Inference
* Machine: EvenOdd Process
* PRIOR -- NO data, ie D=None

** PROBABILITY OF START STATES
**** P(A|M): 2.500000e-01
**** P(B|M): 2.500000e-01
**** P(C|M): 2.500000e-01
**** P(D|M): 2.500000e-01

Dirichlet Distribution:
**Possible start nodes: ['A', 'C', 'B', 'D']
**No data available -- prior

NODES
startNode A
A -- A:    2  N: None
B -- A: None  N: None
C -- A:    2  N: None
D -- A: None  N: None
startNode B
A -- A:    2  N: None
B -- A: None  N: None
C -- A:    2  N: None
D -- A: None  N: None
startNode C
A -- A:    2  N: None
B -- A: None  N: None
C -- A:    2  N: None
D -- A: None  N: None
startNode D
A -- A:    2  N: None
B -- A: None  N: None
C -- A:    2  N: None
D -- A: None  N: None

EDGES
startNode A
('A', '0') -- a:    1  n: None  Dirchlet Exp[p(0|A)]:0.500000
('A', '1') -- a:    1  n: None  Dirchlet Exp[p(1|A)]:0.500000
('B', '0') -- a: None  n: None               p(0|B) :1.000000
('C', '0') -- a:    1  n: None  Dirchlet Exp[p(0|C)]:0.500000
('C', '1') -- a:    1  n: None  Dirchlet Exp[p(1|C)]:0.500000
('D', '1') -- a: None  n: None               p(1|D) :1.000000
startNode B
('A', '0') -- a:    1  n: None  Dirchlet Exp[p(0|A)]:0.500000
('A', '1') -- a:    1  n: None  Dirchlet Exp[p(1|A)]:0.500000
('B', '0') -- a: None  n: None               p(0|B) :1.000000
('C', '0') -- a:    1  n: None  Dirchlet Exp[p(0|C)]:0.500000
('C', '1') -- a:    1  n: None  Dirchlet Exp[p(1|C)]:0.500000
('D', '1') -- a: None  n: None               p(1|D) :1.000000
startNode C
('A', '0') -- a:    1  n: None  Dirchlet Exp[p(0|A)]:0.500000
('A', '1') -- a:    1  n: None  Dirchlet Exp[p(1|A)]:0.500000
('B', '0') -- a: None  n: None               p(0|B) :1.000000
('C', '0') -- a:    1  n: None  Dirchlet Exp[p(0|C)]:0.500000
('C', '1') -- a:    1  n: None  Dirchlet Exp[p(1|C)]:0.500000
('D', '1') -- a: None  n: None               p(1|D) :1.000000
startNode D
('A', '0') -- a:    1  n: None  Dirchlet Exp[p(0|A)]:0.500000
('A', '1') -- a:    1  n: None  Dirchlet Exp[p(1|A)]:0.500000
('B', '0') -- a: None  n: None               p(0|B) :1.000000
('C', '0') -- a:    1  n: None  Dirchlet Exp[p(0|C)]:0.500000
('C', '1') -- a:    1  n: None  Dirchlet Exp[p(1|C)]:0.500000
('D', '1') -- a: None  n: None               p(1|D) :1.000000


## Uncertainty in the prior¶

Looking at the output above makes it seem like there is a single representation,or setting, of the machine specified. However, this is not true. The prior assigns a distribution over possible machines with the assumed topology. The best way to understand this is to sample machines from the prior. Samples reflect the range of ‘machines’ possible given the prior – averages over the sample should reflect the output above.

Let’s sample five machines from the prior:

In[5]:

for n in range(5):
(node,machine) = prior.generate_sample()
print 'name: ', machine
print 'startNode: ', str(node)
print  machine.to_string(), '\n'

name:  Sampled Machine, Start Node: A, Top: EvenOdd Process (reduced) (minimized)
startNode:  A
('A',) ('B',) 0 0.807419158308;
('A',) ('C',) 1 0.192580841692;
('B',) ('A',) 0 1.0;
('C',) ('B',) 0 0.782957754644;
('C',) ('D',) 1 0.217042245356;
('D',) ('C',) 1 1.0;

name:  Sampled Machine, Start Node: B, Top: EvenOdd Process (reduced) (minimized)
startNode:  B
('A',) ('B',) 0 0.580980998312;
('A',) ('C',) 1 0.419019001688;
('B',) ('A',) 0 1.0;
('C',) ('B',) 0 0.613005075527;
('C',) ('D',) 1 0.386994924473;
('D',) ('C',) 1 1.0;

name:  Sampled Machine, Start Node: C, Top: EvenOdd Process (reduced) (minimized)
startNode:  C
('A',) ('B',) 0 0.500716151058;
('A',) ('C',) 1 0.499283848942;
('B',) ('A',) 0 1.0;
('C',) ('B',) 0 0.594676226571;
('C',) ('D',) 1 0.405323773429;
('D',) ('C',) 1 1.0;

name:

 Sampled Machine, Start Node: C, Top: EvenOdd Process (reduced) (minimized)
startNode:  C
('A',) ('B',) 0 0.928242263136;
('A',) ('C',) 1 0.0717577368645;
('B',) ('A',) 0 1.0;
('C',) ('B',) 0 0.568049272319;
('C',) ('D',) 1 0.431950727681;
('D',) ('C',) 1 1.0;

name:  Sampled Machine, Start Node: D, Top: EvenOdd Process (reduced) (minimized)
startNode:  D
('A',) ('B',) 0 0.767459402098;
('A',) ('C',) 1 0.232540597902;
('B',) ('A',) 0 1.0;
('C',) ('B',) 0 0.268509531325;
('C',) ('D',) 1 0.731490468675;
('D',) ('C',) 1 1.0;


## and from the prior¶

It also makes sense to ask what the prior says about functions of parameters to be inferred. The prior over parameters results in a distribution over any function of these parameters. For example, in computational mechanics we are often interested in quantities like the entropy rate and statistical complexity. The distribution over or can be approximated using samples from the prior. As we will see, a uniform prior over start state and transition probabilities does not result in a uniform distribution over functions of the parameters.

Let’s take 5000 samples (increase to get better approximation):

In[6]:

num_samples = 5000 # increase for better estimate
prior_hmu_samples = []
prior_Cmu_samples = []
for n in range(num_samples):
(node,machine) = prior.generate_sample()
prior_hmu_samples.append(machine.entropy_rate())
if type(machine) == cmpy.machines.epsilonmachine.RecurrentEpsilonMachine:
prior_Cmu_samples.append(machine.statistical_complexity())

# hmu
# prior mean, median
print
print 'Prior mean of h_mu  :', average(prior_hmu_samples)
print 'Prior median of h_mu:', median(prior_hmu_samples)
# calculate regions of 95% condfidence from samples
try:
from scipy.stats.mstats import mquantiles
print '95 percent (%f,%f)' % tuple(mquantiles(prior_hmu_samples,prob=[0.025,1.-0.025],
alphap=1., betap=1.))
except:
pass

# Cmu
# prior mean, median
print
print 'Prior mean of C_mu  :', average(prior_Cmu_samples)
print 'Prior median of C_mu:', median(prior_Cmu_samples)
# calculate regions of 95% condfidence from samples
try:
from scipy.stats.mstats import mquantiles
print '95 percent (%f,%f)' % tuple(mquantiles(prior_Cmu_samples,prob=[0.025,1.-0.025],
alphap=1., betap=1.))
except:
pass

print '\n * %d of %d sampled HMMs are epsilon machines' % (len(prior_Cmu_samples), num_samples)

Prior mean of h_mu  : 0.389736887012
Prior median of h_mu: 0.425225760401
95 percent (0.056254,0.568495)

Prior mean of C_mu  : 1.72418894596
Prior median of C_mu: 1.79556971077
95 percent (1.128574,1.983849)

* 5000 of 5000 sampled HMMs are epsilon machines


Plot the samples:

In[7]:

n, bins, patches = plt.hist(prior_hmu_samples, 50,
range=[0.9*min(prior_hmu_samples),1.1*max(prior_hmu_samples)],
normed=1, facecolor='blue', alpha=0.75,
cumulative=False)
plt.xlabel(r'$h_{\mu}$')
plt.ylabel(r'$Pr(h_{\mu} \vert M)$')
plt.title('Samples of entropy rate -- prior')
plt.grid(True)

plt.show()


Plot the samples:

In[8]:

n, bins, patches = plt.hist(prior_Cmu_samples, 50,
range=[0.9*min(prior_Cmu_samples),1.1*max(prior_Cmu_samples)],
normed=1, facecolor='blue', alpha=0.75,
cumulative=False)
plt.xlabel(r'$C_{\mu}$')
plt.ylabel(r'$Pr(C_{\mu} \vert M)$')
plt.title('Samples of statistical complexity -- prior')
plt.grid(True)

plt.show()


Plot samples in , plane

In[9]:

plt.scatter(prior_hmu_samples,prior_Cmu_samples,s=20,
facecolor='blue', edgecolor='none',alpha=0.25)
plt.ylabel(r'$C_{\mu}$')
plt.xlabel(r'$h_{\mu}$')
plt.title(r'$C_{\mu}$ vs $h_{\mu}$ -- prior')
plt.grid(True)

plt.show()


## Posterior¶

The posterior provides distributions over properties to be inferred given data and a priori assumptions. As with the prior, we assume the machine topology/structure is given, but that

1. start state/node is unknown
1. transitition probabilities (not defined to be 1.0) are unknown.

## Data¶

Generate some data (we use 1000 symbols – more data results in better inference) to use for inference:

In[10]:

eM.randomize_current_node()        # randomize current node
startNode = eM.get_current_node()  # save the node, for later comparison
print 'startNode:', startNode
data = eM.symbols(1000)            # create data

startNode: C


To obtain the posterior, pass machine topology/structure and data to InferEM:

In[11]:

# pass data to get posterior
posterior = InferEM(eM,data)


As with the prior, use the summary_string() method to see what the posterior assigns. Compare the output with the prior – the effects of data should be apparent. In particular, the probability of each start state/node should no longer be equal and transition probabilities not defined to be 1.0 should be different than the prior estimations.

Also, note that counts for each node and edge are available conditioned on start state (if a path is possible given that start node).

In[12]:

print posterior.summary_string()

eMachine Inference
* Machine: EvenOdd Process
* POSTERIOR -- data available

** EVIDENCE TERMS
* logP(D|M): -4.034870e+02
** logP(D|A, M): -4.028236e+02
**** logP(D|B, M): -inf
**** logP(D|C, M): -4.027650e+02
**** logP(D|D, M): -inf

** PROBABILITY OF START STATES
**** P(A|D, M): 4.853461e-01
**** P(B|D, M): 0.000000e+00
**** P(C|D, M): 5.146539e-01
**** P(D|D, M): 0.000000e+00

Dirichlet Distribution:
**Possible start nodes: ['A', 'C']
**Data available -- posterior
** log P(D | s_i,0=A , M_i) = -4.0282360867e+02
** log P(D | s_i,0=C , M_i) = -4.0276497641e+02

NODES
startNode A
A -- A:    2  N:  291
B -- A: None  N:  290
C -- A:    2  N:  283
D -- A: None  N:  136
startNode C
A -- A:    2  N:  290
B -- A: None  N:  290
C -- A:    2  N:  284
D -- A: None  N:  136

EDGES
startNode A
('A', '0') -- a:    1  n:  143  Dirchlet Exp[p(0|A)]:0.491468
('A', '1') -- a:    1  n:  148  Dirchlet Exp[p(1|A)]:0.508532
('B', '0') -- a: None  n:  290               p(0|B) :1.000000
('C', '0') -- a:    1  n:  147  Dirchlet Exp[p(0|C)]:0.519298
('C', '1') -- a:    1  n:  136  Dirchlet Exp[p(1|C)]:0.480702
('D', '1') -- a: None  n:  136               p(1|D) :1.000000
startNode C
('A', '0') -- a:    1  n:  142  Dirchlet Exp[p(0|A)]:0.489726
('A', '1') -- a:    1  n:  148  Dirchlet Exp[p(1|A)]:0.510274
('B', '0') -- a: None  n:  290               p(0|B) :1.000000
('C', '0') -- a:    1  n:  148  Dirchlet Exp[p(0|C)]:0.520979
('C', '1') -- a:    1  n:  136  Dirchlet Exp[p(1|C)]:0.479021
('D', '1') -- a: None  n:  136               p(1|D) :1.000000


## Uncertainty in the posterior¶

The posterior, like the prior, assigns a distribution over possible machines with the assumed topology. Now this distribution is shaped by data and a priori assumptions. However, there is still uncertainty in parameter settings – although the amount of uncertainty should be much reduced, depending on the amount of data available. Again, let’s sample machines, this time from the posterior:

In[13]:

for n in range(5):
(node,machine) = posterior.generate_sample()
print 'name: ', machine
print 'startNode: ', str(node)
print  machine.to_string(), '\n'

name:  Sampled Machine, Start Node: A, Top: EvenOdd Process (reduced) (minimized)
startNode:  A
('A',) ('B',) 0 0.486735301385;
('A',) ('C',) 1 0.513264698615;
('B',) ('A',) 0 1.0;
('C',) ('B',) 0 0.550042993493;
('C',) ('D',) 1 0.449957006507;
('D',) ('C',) 1 1.0;

name:  Sampled Machine, Start Node: A, Top: EvenOdd Process (reduced) (minimized)
startNode:  A
('A',) ('B',) 0 0.514669695279;
('A',) ('C',) 1 0.485330304721;
('B',) ('A',) 0 1.0;
('C',) ('B',) 0 0.53546230131;
('C',) ('D',) 1 0.46453769869;
('D',) ('C',) 1 1.0;

name:  Sampled Machine, Start Node: C, Top: EvenOdd Process (reduced) (minimized)
startNode:  C
('A',) ('B',) 0 0.421802263194;
('A',) ('C',) 1 0.578197736806;
('B',) ('A',) 0 1.0;
('C',) ('B',) 0 0.489879988389;
('C',) ('D',) 1 0.510120011611;
('D',) ('C',) 1 1.0;

name:

 Sampled Machine, Start Node: C, Top: EvenOdd Process (reduced) (minimized)
startNode:  C
('A',) ('B',) 0 0.485779423905;
('A',) ('C',) 1 0.514220576095;
('B',) ('A',) 0 1.0;
('C',) ('B',) 0 0.496308145276;
('C',) ('D',) 1 0.503691854724;
('D',) ('C',) 1 1.0;

name:  Sampled Machine, Start Node: C, Top: EvenOdd Process (reduced) (minimized)
startNode:  C
('A',) ('B',) 0 0.492476519525;
('A',) ('C',) 1 0.507523480475;
('B',) ('A',) 0 1.0;
('C',) ('B',) 0 0.501246160218;
('C',) ('D',) 1 0.498753839782;
('D',) ('C',) 1 1.0;


## and from the posterior¶

Next, consider what the posterior says about functions of parameters to be inferred. As we did with the prior, let’s estimate the distributions for and as example functions and consider the difference data makes in the results – hopefully the distributions are near the correct values if enough data was provided.

In[14]:

num_samples = 5000 # increase for better estimate
posterior_hmu_samples = []
posterior_Cmu_samples = []
for n in range(num_samples):
(node,machine) = posterior.generate_sample()
posterior_hmu_samples.append(machine.entropy_rate())
if type(machine) == cmpy.machines.epsilonmachine.RecurrentEpsilonMachine:
posterior_Cmu_samples.append(machine.statistical_complexity())

# hmu
# posterior mean, median
print 'Posterior mean of h_mu  :', average(posterior_hmu_samples)
print 'Posterior median of h_mu:', median(posterior_hmu_samples)
# calculate regions of 95% condfidence from samples
try:
from scipy.stats.mstats import mquantiles
print '95 percent (%f,%f)' % tuple(mquantiles(posterior_hmu_samples,prob=[0.025,1.-0.025],
alphap=1., betap=1.))
except:
pass
print 'Real value: ', eM.entropy_rate()

# Cmu
# posterior mean, median
print
print 'Posterior mean of C_mu  :', average(posterior_Cmu_samples)
print 'Posterior median of C_mu:', median(posterior_Cmu_samples)
# calculate regions of 95% condfidence from samples
try:
from scipy.stats.mstats import mquantiles
print '95 percent (%f,%f)' % tuple(mquantiles(posterior_Cmu_samples,prob=[0.025,1.-0.025],
alphap=1., betap=1.))
except:
pass
print 'Real value: ', eM.statistical_complexity()
print '\n * %d of %d sampled HMMs are epsilon machines' % (len(posterior_Cmu_samples), num_samples)

Posterior mean of h_mu  : 0.571952448396
Posterior median of h_mu: 0.572626412905
95 percent (0.564680,0.575397)

Real value:  0.571428571429

Posterior mean of C_mu  : 1.94198131217
Posterior median of C_mu: 1.94337447166
95 percent (1.911174,1.965286)

Real value:  1.95021206491

* 5000 of 5000 sampled HMMs are epsilon machines


Plot the samples:

In[15]:

n, bins, patches = plt.hist(posterior_hmu_samples, 50,
range=[0.9*min(posterior_hmu_samples),1.1*max(posterior_hmu_samples)],
normed=1, facecolor='green', alpha=0.75,
cumulative=False)
plt.xlabel(r'$h_{\mu}$')
plt.ylabel(r'$Pr(h_{\mu} \vert D, M)$')  # note, conditioned on data and topology
plt.title('Samples of entropy rate -- posterior')
plt.grid(True)

plt.show()


In[16]:

n, bins, patches = plt.hist(posterior_Cmu_samples, 50,
range=[0.9*min(posterior_Cmu_samples),1.1*max(posterior_Cmu_samples)],
normed=1, facecolor='green', alpha=0.75,
cumulative=False)
plt.xlabel(r'$C_{\mu}$')
plt.ylabel(r'$Pr(C_{\mu} \vert D, M)$')  # note, conditioned on data and topology
plt.title('Samples of statistical complexity -- posterior')
plt.grid(True)

plt.show()


In[17]:

plt.scatter(posterior_hmu_samples,posterior_Cmu_samples,s=20,
facecolor='green', edgecolor='none',alpha=0.75)
plt.ylabel(r'$C_{\mu}$')
plt.xlabel(r'$h_{\mu}$')
plt.title(r'$C_{\mu}$ vs $h_{\mu}$ -- posterior')
plt.grid(True)

plt.show()


## Compare prior and posterior¶

Finally, compare prior and posterior in the complexity-entropy rate plane

In[18]:

# prior - blue
plt.scatter(prior_hmu_samples,prior_Cmu_samples,s=20,
facecolor='blue', edgecolor='none',alpha=0.25)

# posterior - green
plt.scatter(posterior_hmu_samples,posterior_Cmu_samples,s=20,
facecolor='green', edgecolor='none',alpha=0.75)

plt.ylabel(r'$C_{\mu}$')
plt.xlabel(r'$h_{\mu}$')
plt.title(r'$C_{\mu}$ vs $h_{\mu}$ -- prior(blue) / posterior(green)')
plt.grid(True)

plt.show()