# General Mixture Models¶

IPython Notebook Tutorial

General Mixture Models (GMMs) are an unsupervised model composed of multiple distributions (commonly also referred to as components) and corresponding weights. This allows you to model more sophisticated phenomena probabilistically. A common task is to figure out which component a new data point comes from given only a large quantity of unlabelled data.

## Initialization¶

General Mixture Models can be initialized in two ways depending on if you know the initial parameters of the distributions of not. If you do know the prior parameters of the distributions then you can pass them in as a list. These do not have to be the same distribution–you can mix and match distributions as you want. You can also pass in the weights, or the prior probability of a sample belonging to that component of the model.

>>> from pomegranate import *
>>> gmm = GeneralMixtureModel([NormalDistribution(5, 2), NormalDistribution(1, 2)], weights=[0.33, 0.67])


If you do not know the initial parameters, then the components can be initialized using kmeans to find initial clusters. Initial parameters for the models are then extracted from the clusters and EM is used to fine tune the model.

>>> from pomegranate import *
>>> gmm = GeneralMixtureModel( NormalDistribution, n_components=2 )


This allows any distribution in pomegranate to be natively used in GMMs.

## Log Probability¶

The probability of a point is the sum of its probability under each of the components, multiplied by the weight of each component c, $$P(D|M) = \sum\limits_{c \in M} P(D|c)$$. This is easily calculated by summing the probability under each distribution in the mixture model and multiplying by the appropriate weights, and then taking the log.

## Prediction¶

The common prediction tasks involve predicting which component a new point falls under. This is done using Bayes rule $$P(M|D) = \frac{P(D|M)P(M)}{P(D)}$$ to determine the posterior probability $$P(M|D)$$ as opposed to simply the likelihood $$P(D|M)$$. Bayes rule indicates that it isn’t simply the likelihood function which makes this prediction but the likelihood function multiplied by the probability that that distribution generated the sample. For example, if you have a distribution which has 100x as many samples fall under it, you would naively think that there is a ~99% chance that any random point would be drawn from it. Your belief would then be updated based on how well the point fit each distribution, but the proportion of points generated by each sample is important as well.

We can get the component label assignments using model.predict(data), which will return an array of indexes corresponding to the maximally likely component. If what we want is the full matrix of $$P(M|D)$$, then we can use model.predict_proba(data), which will return a matrix with each row being a sample, each column being a component, and each cell being the probability that that model generated that data. If we want log probabilities instead we can use model.predict_log_proba(data) instead.

## Fitting¶

Training GMMs faces the classic chicken-and-egg problem that most unsupervised learning algorithms face. If we knew which component a sample belonged to, we could use MLE estimates to update the component. And if we knew the parameters of the components we could predict which sample belonged to which component. This problem is solved using expectation-maximization, which iterates between the two until convergence. In essence, an initialization point is chosen which usually is not a very good start, but through successive iteration steps, the parameters converge to a good ending.

These models are fit using model.fit(data). A maximimum number of iterations can be specified as well as a stopping threshold for the improvement ratio. See the API reference for full documentation.

## API Reference¶

class pomegranate.gmm.GeneralMixtureModel

A General Mixture Model.

This mixture model can be a mixture of any distribution as long as they are all of the same dimensionality. Any object can serve as a distribution as long as it has fit(X, weights), log_probability(X), and summarize(X, weights)/from_summaries() methods if out of core training is desired.

Parameters: distributions : array-like, shape (n_components,) or callable The components of the model. If array, corresponds to the initial distributions of the components. If callable, must also pass in the number of components and kmeans++ will be used to initialize them. weights : array-like, optional, shape (n_components,) The prior probabilities corresponding to each component. Does not need to sum to one, but will be normalized to sum to one internally. Defaults to None. n_components : int, optional If a callable is passed into distributions then this is the number of components to initialize using the kmeans++ algorithm. Defaults to None.

Examples

>>> from pomegranate import *
>>> clf = GeneralMixtureModel([
>>>     NormalDistribution(5, 2),
>>>     NormalDistribution(1, 1)])
>>> clf.log_probability(5)
-2.304562194038089
>>> clf.predict_proba([[5], [7], [1]])
array([[ 0.99932952,  0.00067048],
[ 0.99999995,  0.00000005],
[ 0.06337894,  0.93662106]])
>>> clf.fit([[1], [5], [7], [8], [2]])
>>> clf.predict_proba([[5], [7], [1]])
array([[ 1.        ,  0.        ],
[ 1.        ,  0.        ],
[ 0.00004383,  0.99995617]])
>>> clf.distributions
array([ {
"frozen" :false,
"class" :"Distribution",
"parameters" :[
6.6571359101390755,
1.2639830514274502
],
"name" :"NormalDistribution"
},
{
"frozen" :false,
"class" :"Distribution",
"parameters" :[
1.498707696758334,
0.4999983303277837
],
"name" :"NormalDistribution"
}], dtype=object)


Attributes

 distributions (array-like, shape (n_components,)) The component distribution objects. weights (array-like, shape (n_components,)) The learned prior weight of each object
copy()

Return a deep copy of this distribution object.

This object will not be tied to any other distribution or connected in any form.

Parameters: None distribution : Distribution A copy of the distribution with the same parameters.
fit()

Fit the model to new data using EM.

This method fits the components of the model to new data using the EM method. It will iterate until either max iterations has been reached, or the stop threshold has been passed.

This is a sklearn wrapper for train method.

Parameters: X : array-like, shape (n_samples, n_dimensions) This is the data to train on. Each row is a sample, and each column is a dimension to train on. weights : array-like, shape (n_samples,), optional The initial weights of each sample in the matrix. If nothing is passed in then each sample is assumed to be the same weight. Default is None. inertia : double, optional The weight of the previous parameters of the model. The new parameters will roughly be old_param*inertia + new_param*(1-inertia), so an inertia of 0 means ignore the old parameters, whereas an inertia of 1 means ignore the new parameters. Default is 0.0. stop_threshold : double, optional, positive The threshold at which EM will terminate for the improvement of the model. If the model does not improve its fit of the data by a log probability of 0.1 then terminate. Default is 0.1. max_iterations : int, optional, positive The maximum number of iterations to run EM for. If this limit is hit then it will terminate training, regardless of how well the model is improving per iteration. Default is 1e8. verbose : bool, optional Whether or not to print out improvement information over iterations. Default is False. improvement : double The total improvement in log probability P(D|M)
freeze()

Freeze the distribution, preventing updates from occuring.

from_summaries()

Fit the model to the collected sufficient statistics.

Fit the parameters of the model to the sufficient statistics gathered during the summarize calls. This should return an exact update.

Parameters: inertia : double, optional The weight of the previous parameters of the model. The new parameters will roughly be old_param*inertia + new_param*(1-inertia), so an inertia of 0 means ignore the old parameters, whereas an inertia of 1 means ignore the new parameters. Default is 0.0. None
log_probability()

Calculate the log probability of a point under the distribution.

The probability of a point is the sum of the probabilities of each distribution multiplied by the weights. Thus, the log probability is the sum of the log probability plus the log prior.

This is the python interface.

Parameters: X : numpy.ndarray, shape=(n, d) or (n, m, d) The samples to calculate the log probability of. Each row is a sample and each column is a dimension. If emissions are HMMs then shape is (n, m, d) where m is variable length for each obervation, and X becomes an array of n (m, d)-shaped arrays. log_probability : double The log probabiltiy of the point under the distribution.
predict()

Predict the most likely component which generated each sample.

Calculate the posterior P(M|D) for each sample and return the index of the component most likely to fit it. This corresponds to a simple argmax over the responsibility matrix.

This is a sklearn wrapper for the maximum_a_posteriori method.

Parameters: X : array-like, shape (n_samples, n_dimensions) The samples to do the prediction on. Each sample is a row and each column corresponds to a dimension in that sample. For univariate distributions, a single array may be passed in. y : array-like, shape (n_samples,) The predicted component which fits the sample the best.
predict_log_proba()

Calculate the posterior log P(M|D) for data.

Calculate the log probability of each item having been generated from each component in the model. This returns normalized log probabilities such that the probabilities should sum to 1

This is a sklearn wrapper for the original posterior function.

Parameters: X : array-like, shape (n_samples, n_dimensions) The samples to do the prediction on. Each sample is a row and each column corresponds to a dimension in that sample. For univariate distributions, a single array may be passed in. y : array-like, shape (n_samples, n_components) The normalized log probability log P(M|D) for each sample. This is the probability that the sample was generated from each component.
predict_proba()

Calculate the posterior P(M|D) for data.

Calculate the probability of each item having been generated from each component in the model. This returns normalized probabilities such that each row should sum to 1.

Since calculating the log probability is much faster, this is just a wrapper which exponentiates the log probability matrix.

Parameters: X : array-like, shape (n_samples, n_dimensions) The samples to do the prediction on. Each sample is a row and each column corresponds to a dimension in that sample. For univariate distributions, a single array may be passed in. probability : array-like, shape (n_samples, n_components) The normalized probability P(M|D) for each sample. This is the probability that the sample was generated from each component.
probability()

Return the probability of the given symbol under this distribution.

Parameters: symbol : object The symbol to calculate the probability of probability : double The probability of that point under the distribution.
sample()

Generate a sample from the model.

First, randomly select a component weighted by the prior probability, Then, use the sample method from that component to generate a sample.

Parameters: n : int, optional The number of samples to generate. Defaults to 1. sample : array-like or object A randomly generated sample from the model of the type modelled by the emissions. An integer if using most distributions, or an array if using multivariate ones, or a string for most discrete distributions. If n=1 return an object, if n>1 return an array of the samples.
summarize()

Summarize a batch of data and store sufficient statistics.

This will run the expectation step of EM and store sufficient statistics in the appropriate distribution objects. The summarization can be thought of as a chunk of the E step, and the from_summaries method as the M step.

Parameters: X : array-like, shape (n_samples, n_dimensions) This is the data to train on. Each row is a sample, and each column is a dimension to train on. weights : array-like, shape (n_samples,), optional The initial weights of each sample in the matrix. If nothing is passed in then each sample is assumed to be the same weight. Default is None. logp : double The log probability of the data given the current model. This is used to speed up EM.
thaw()

Thaw the distribution, re-allowing updates to occur.