Markov Chains

author: Jacob Schreiber contact: jmschreiber91@gmail.com

Markov chains are the simplest probabilistic model describing a sequence of observations. Essentially, for an n-th order Markov chain, each observation is modeled as \(P(X_{t} | X_{t-1}, ..., X_{t-n})\) and the probability of the entire sequence is the product of these probabilities for each observation. Naturally, the first observation in the sequence cannot be modeled as this conditional distribution and so is usually modeled as a marginal distribution. The remaining \(n-1\) observations also cannot be modeled as this full conditional distribution and so are modeled by smaller distributions.

These chains are easy to implement and use in pomegranate.

[1]:

%pylab inline
import seaborn; seaborn.set_style('whitegrid')
import torch

%load_ext watermark
%watermark -m -n -p torch,pomegranate

Populating the interactive namespace from numpy and matplotlib
torch      : 1.13.0
pomegranate: 1.0.0

Compiler    : GCC 11.2.0
OS          : Linux
Release     : 4.15.0-208-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 8
Architecture: 64bit

Initialization and Fitting

Initializing a Markov chain is simple. If you have fit distributions, you can pass them in and then use the model for inferene. If you do not have fit distributions, you can specify the k parameter, which is the number of previous observations that the probability of each observation is conditioned on. pomegranate will automatically construct the first k-1 distributions as well as the main conditional distribution.

[2]:

from pomegranate.markov_chain import MarkovChain

model = MarkovChain(k=3)

The model can then be fit to data using the fit function. However, this data must be three dimensional, with the dimensions being (n_samples, length, dimensions). Most other models in pomegranate only use the first two. This does mean that Markov chains can be multivariate but a multivariate model will assume each of the dimensions are independent of each other.

[3]:

X = torch.tensor([[[1], [0], [0], [1]],
                  [[0], [1], [0], [0]],
                  [[0], [0], [0], [0]],
                  [[0], [0], [0], [1]],
                  [[0], [1], [1], [0]]])

model.fit(X)

[3]:

MarkovChain()

We can then inspect the distribution and compare them to the data to ensure that they’re right.

[4]:

model.distributions[0].probs[0]

[4]:

tensor([0.8000, 0.2000])

[5]:

model.distributions[1].probs[0]

[5]:

Parameter containing:
tensor([[0.5000, 0.5000],
        [1.0000, 0.0000]])

[6]:

model.distributions[2].probs[0]

[6]:

Parameter containing:
tensor([[[1.0000, 0.0000],
         [0.5000, 0.5000]],

        [[1.0000, 0.0000],
         [0.5000, 0.5000]]])

If we wanted to fit a multivariate model all we would need to do is increase the size of the second dimension.

Probability and Log Probability

The probability of a sequence under a Markov chain is the product of the probabilities of each observation given the previous observations. Another way of putting this is that the joint probability of a sequence with n observations \(P(X_{1} ... X_{n})\) is factorized along a chain and equal to \(P(X_{1}) P(X_{2} | X_{1}) \prod\limits_{t=3}^{n} P(X_{t} | X_{t-1}, X_{t-2})\) for a third order Markov chain.

If you data is multivariate, the probability of each dimension is independent and multiplied together at the end. If you would like dependencies between your dimensions, you should consider encoding a single dimension to include all combinations of observations across the features.

We can calculate the probability and log probability just as easily as other models.

[7]:

model.probability(X)

[7]:

tensor([0.2000, 0.2000, 0.2000, 0.2000, 0.2000])

[8]:

model.log_probability(X)

[8]:

tensor([-1.6094, -1.6094, -1.6094, -1.6094, -1.6094])

Summarization

Markov chains can perform summarization of data just like other models but that data has to have the three dimensions mentioned before. Further, each chunk of data that is summarized must have the same length. This means that if you have data with different lengths, you must either summarize them one at a time or bucket the sequences such that each bucket has all the sequences of the same length.

[9]:

X = torch.randint(2, size=(30, 8, 1))

model.summarize(X[:10])
model.summarize(X[10:])
model.from_summaries()

Sampling

[10]:

X_sample = model.sample(100000).type(torch.float32)

[11]:

model.distributions[0].probs

[11]:

Parameter containing:
tensor([[0.4667, 0.5333]])

[12]:

X_sample[:, 0, 0].mean()

[12]:

tensor(0.5332)

[13]:

model.distributions[1].probs[0]

[13]:

Parameter containing:
tensor([[0.3571, 0.6429],
        [0.5000, 0.5000]])

[14]:

X_sample[X_sample[:, 0, 0] == 1, 1, 0].mean()

[14]:

tensor(0.5039)

[15]:

model.distributions[2].probs[0]

[15]:

Parameter containing:
tensor([[[0.4000, 0.6000],
         [0.2222, 0.7778]],

        [[0.1250, 0.8750],
         [0.7500, 0.2500]]])

[16]:

X_sample[(X_sample[:, 0, 0] == 1) & (X_sample[:, 1, 0] == 1), 2, 0].mean()

[16]:

tensor(0.2550)