# General Mixture Models¶

IPython Notebook Tutorial

General Mixture models (GMMs) are an unsupervised probabilistic model composed of multiple distributions (commonly referred to as components) and corresponding weights. This allows you to model more complex distributions corresponding to a singular underlying phenomena. For a full tutorial on what a mixture model is and how to use them, see the above tutorial.

## Initialization¶

General Mixture Models can be initialized in two ways depending on if you know the initial parameters of the model or not: (1) passing in a list of pre-initialized distributions, or (2) running the from_samples class method on data. The initial parameters can be either a pre-specified model that is ready to be used for prediction, or the initialization for expectation-maximization. Otherwise, if the second initialization option is chosen, then k-means is used to initialize the distributions. The distributions passed for each component don’t have to be the same type, and if an IndependentComponentDistribution object is passed in, then the dimensions don’t need to be modeled by the same distribution.

Here is an example of a traditional multivariate Gaussian mixture where we pass in pre-initialized distributions. We can also pass in the weight of each component, which serves as the prior probability of a sample belonging to that component when doing predictions.

>>> from pomegranate import *
>>> d1 = MultivariateGaussianDistribution([1, 6, 3], [[1, 0, 0], [0, 1, 0], [0, 0, 1]])
>>> d2 = MultivariateGaussianDistribution([2, 8, 4], [[1, 0, 0], [0, 1, 0], [0, 0, 2]])
>>> d3 = MultivariateGaussianDistribution([0, 4, 8], [[2, 0, 0], [0, 3, 0], [0, 0, 1]])
>>> model = GeneralMixtureModel([d1, d2, d3], weights=[0.25, 0.60, 0.15])


Alternatively, if we want to model each dimension differently, then we can replace the multivariate Gaussian distributions with IndependentComponentsDistribution objects.

>>> from pomegranate import *
>>> d1 = IndependentComponentsDistribution([NormalDistribution(5, 2), ExponentialDistribution(1), LogNormalDistribution(0.4, 0.1)])
>>> d2 = IndependentComponentsDistribution([NormalDistribution(3, 1), ExponentialDistribution(2), LogNormalDistribution(0.8, 0.2)])
>>> model = GeneralMixtureModel([d1, d2], weights=[0.66, 0.34])


If we do not know the parameters of our distributions beforehand and want to learn them entirely from data, then we can use the from_samples class method. This method will run k-means to initialize the components, using the returned clusters to initialize all parameters of the distributions, i.e. both mean and covariances for multivariate Gaussian distributions. Afterwards, expectation-maximization is used to refine the parameters of the model, iterating until convergence.

>>> from pomegranate import *
>>> model = GeneralMixtureModel.from_samples(MultivariateGaussianDistribution, n_components=3, X=X)


If we want to model each dimension using a different distribution, then we can pass in a list of callables and they will be initialized using k-means as well.

>>> from pomegranate import *
>>> model = GeneralMixtureModel.from_samples([NormalDistribution, ExponentialDistribution, LogNormalDistribution], n_components=5, X=X)


## Probability¶

The probability of a point is the sum of its probability under each of the components, multiplied by the weight of each component c, $$P = \sum\limits_{i \in M} P(D|M_{i})P(M_{i})$$. The probability method returns the probability of each sample under the entire mixture, and the log_probability method returns the log of that value.

## Prediction¶

The common prediction tasks involve predicting which component a new point falls under. This is done using Bayes rule $$P(M|D) = \frac{P(D|M)P(M)}{P(D)}$$ to determine the posterior probability $$P(M|D)$$ as opposed to simply the likelihood $$P(D|M)$$. Bayes rule indicates that it isn’t simply the likelihood function which makes this prediction but the likelihood function multiplied by the probability that that distribution generated the sample. For example, if you have a distribution which has 100x as many samples fall under it, you would naively think that there is a ~99% chance that any random point would be drawn from it. Your belief would then be updated based on how well the point fit each distribution, but the proportion of points generated by each sample is important as well.

We can get the component label assignments using model.predict(data), which will return an array of indexes corresponding to the maximally likely component. If what we want is the full matrix of $$P(M|D)$$, then we can use model.predict_proba(data), which will return a matrix with each row being a sample, each column being a component, and each cell being the probability that that model generated that data. If we want log probabilities instead we can use model.predict_log_proba(data) instead.

## Fitting¶

Training GMMs faces the classic chicken-and-egg problem that most unsupervised learning algorithms face. If we knew which component a sample belonged to, we could use MLE estimates to update the component. And if we knew the parameters of the components we could predict which sample belonged to which component. This problem is solved using expectation-maximization, which iterates between the two until convergence. In essence, an initialization point is chosen which usually is not a very good start, but through successive iteration steps, the parameters converge to a good ending.

These models are fit using model.fit(data). A maximum number of iterations can be specified as well as a stopping threshold for the improvement ratio. See the API reference for full documentation.

## API Reference¶

class pomegranate.gmm.GeneralMixtureModel

A General Mixture Model.

This mixture model can be a mixture of any distribution as long as they are all of the same dimensionality. Any object can serve as a distribution as long as it has fit(X, weights), log_probability(X), and summarize(X, weights)/from_summaries() methods if out of core training is desired.

Parameters
distributionsarray-like, shape (n_components,)

The components of the model as initialized distributions.

weightsarray-like, optional, shape (n_components,)

The prior probabilities corresponding to each component. Does not need to sum to one, but will be normalized to sum to one internally. Defaults to None.

Examples

>>> from pomegranate import *
>>>
>>> d1 = NormalDistribution(5, 2)
>>> d2 = NormalDistribution(1, 1)
>>>
>>> clf = GeneralMixtureModel([d1, d2])
>>> clf.log_probability(5)
-2.304562194038089
>>> clf.predict_proba([[5], [7], [1]])
array([[ 0.99932952,  0.00067048],
[ 0.99999995,  0.00000005],
[ 0.06337894,  0.93662106]])
>>> clf.fit([[1], [5], [7], [8], [2]])
>>> clf.predict_proba([[5], [7], [1]])
array([[ 1.        ,  0.        ],
[ 1.        ,  0.        ],
[ 0.00004383,  0.99995617]])
>>> clf.distributions
array([ {
"frozen" :false,
"class" :"Distribution",
"parameters" :[
6.6571359101390755,
1.2639830514274502
],
"name" :"NormalDistribution"
},
{
"frozen" :false,
"class" :"Distribution",
"parameters" :[
1.498707696758334,
0.4999983303277837
],
"name" :"NormalDistribution"
}], dtype=object)

Attributes
distributionsarray-like, shape (n_components,)

The component distribution objects.

weightsarray-like, shape (n_components,)

The learned prior weight of each object

clear_summaries()

Remove the stored sufficient statistics.

Parameters
None
Returns
None
copy()

Return a deep copy of this distribution object.

This object will not be tied to any other distribution or connected in any form.

Parameters
None
Returns
distributionDistribution

A copy of the distribution with the same parameters.

fit()

Fit the model to new data using EM.

This method fits the components of the model to new data using the EM method. It will iterate until either max iterations has been reached, or the stop threshold has been passed.

Parameters
Xarray-like or generator, shape (n_samples, n_dimensions)

This is the data to train on. Each row is a sample, and each column is a dimension to train on.

weightsarray-like, shape (n_samples,), optional

The initial weights of each sample in the matrix. If nothing is passed in then each sample is assumed to be the same weight. Default is None.

The weight of the previous parameters of the model. The new parameters will roughly be old_param*inertia + new_param*(1-inertia), so an inertia of 0 means ignore the old parameters, whereas an inertia of 1 means ignore the new parameters. Default is 0.0.

pseudocountdouble, optional, positive

A pseudocount to add to the emission of each distribution. This effectively smoothes the states to prevent 0. probability symbols if they don’t happen to occur in the data. Only effects mixture models defined over discrete distributions. Default is 0.

stop_thresholddouble, optional, positive

The threshold at which EM will terminate for the improvement of the model. If the model does not improve its fit of the data by a log probability of 0.1 then terminate. Default is 0.1.

max_iterationsint, optional, positive

The maximum number of iterations to run EM for. If this limit is hit then it will terminate training, regardless of how well the model is improving per iteration. Default is 1e8.

batch_sizeint or None, optional

The number of samples in a batch to summarize on. This controls the size of the set sent to summarize and so does not make the update any less exact. This is useful when training on a memory map and cannot load all the data into memory. If set to None, batch_size is 1 / n_jobs. Default is None.

batches_per_epochint or None, optional

The number of batches in an epoch. This is the number of batches to summarize before calling from_summaries and updating the model parameters. This allows one to do minibatch updates by updating the model parameters before setting the full dataset. If set to None, uses the full dataset. Default is None.

lr_decaydouble, optional, positive

The step size decay as a function of the number of iterations. Functionally, this sets the inertia to be (2+k)^{-lr_decay} where k is the number of iterations. This causes initial iterations to have more of an impact than later iterations, and is frequently used in minibatch learning. This value is suggested to be between 0.5 and 1. Default is 0, meaning no decay.

callbackslist, optional

A list of callback objects that describe functionality that should be undertaken over the course of training.

return_historybool, optional

Whether to return the history during training as well as the model.

verbosebool, optional

Whether or not to print out improvement information over iterations. Default is False.

n_jobsint, optional

The number of threads to use when parallelizing the job. This parameter is passed directly into joblib. Default is 1, indicating no parallelism.

Returns
selfGeneralMixtureModel

The fit mixture model.

freeze()

Freeze the distribution, preventing updates from occurring.

from_dict()

Deserialize this object from a dictionary of parameters.

from_json()

Deserialize this object from its JSON representation.

Parameters
sstr

A JSON formatted string containing the file.

Returns
modelobject

A properly initialized and baked model.

from_samples()

Create a mixture model directly from the given dataset.

First, k-means will be run using the given initializations, in order to define initial clusters for the points. These clusters are used to initialize the distributions used. Then, EM is run to refine the parameters of these distributions.

A homogeneous mixture can be defined by passing in a single distribution callable as the first parameter and specifying the number of components, while a heterogeneous mixture can be defined by passing in a list of callables of the appropriate type.

Parameters
distributionsarray-like, shape (n_components,) or callable

The components of the model. If array, corresponds to the initial distributions of the components. If callable, must also pass in the number of components and kmeans++ will be used to initialize them.

n_componentsint

If a callable is passed into distributions then this is the number of components to initialize using the kmeans++ algorithm.

Xarray-like, shape (n_samples, n_dimensions)

This is the data to train on. Each row is a sample, and each column is a dimension to train on.

weightsarray-like, shape (n_samples,), optional

The initial weights of each sample in the matrix. If nothing is passed in then each sample is assumed to be the same weight. Default is None.

n_initint, optional

The number of initializations of k-means to do before choosing the best. Default is 1.

initstr, optional

The initialization algorithm to use for the initial k-means clustering. Must be one of ‘first-k’, ‘random’, ‘kmeans++’, or ‘kmeans||’. Default is ‘kmeans++’.

max_kmeans_iterationsint, optional

The maximum number of iterations to run kmeans for in the initialization step. Default is 1.

The weight of the previous parameters of the model. The new parameters will roughly be old_param*inertia + new_param*(1-inertia), so an inertia of 0 means ignore the old parameters, whereas an inertia of 1 means ignore the new parameters. Default is 0.0.

pseudocountdouble, optional, positive

A pseudocount to add to the emission of each distribution. This effectively smoothes the states to prevent 0. probability symbols if they don’t happen to occur in the data. Only effects mixture models defined over discrete distributions. Default is 0.

stop_thresholddouble, optional, positive

The threshold at which EM will terminate for the improvement of the model. If the model does not improve its fit of the data by a log probability of 0.1 then terminate. Default is 0.1.

max_iterationsint, optional, positive

The maximum number of iterations to run EM for. If this limit is hit then it will terminate training, regardless of how well the model is improving per iteration. Default is 1e8.

batch_sizeint or None, optional

The number of samples in a batch to summarize on. This controls the size of the set sent to summarize and so does not make the update any less exact. This is useful when training on a memory map and cannot load all the data into memory. If set to None, batch_size is 1 / n_jobs. Default is None.

batches_per_epochint or None, optional

The number of batches in an epoch. This is the number of batches to summarize before calling from_summaries and updating the model parameters. This allows one to do minibatch updates by updating the model parameters before setting the full dataset. If set to None, uses the full dataset. Default is None.

lr_decaydouble, optional, positive

The step size decay as a function of the number of iterations. Functionally, this sets the inertia to be (2+k)^{-lr_decay} where k is the number of iterations. This causes initial iterations to have more of an impact than later iterations, and is frequently used in minibatch learning. This value is suggested to be between 0.5 and 1. Default is 0, meaning no decay.

callbackslist, optional

A list of callback objects that describe functionality that should be undertaken over the course of training.

return_historybool, optional

Whether to return the history during training as well as the model.

verbosebool, optional

Whether or not to print out improvement information over iterations. Default is False.

n_jobsint, optional

The number of threads to use when parallelizing the job. This parameter is passed directly into joblib. Default is 1, indicating no parallelism.

from_summaries()

Fit the model to the collected sufficient statistics.

Fit the parameters of the model to the sufficient statistics gathered during the summarize calls. This should return an exact update.

Parameters

The weight of the previous parameters of the model. The new parameters will roughly be old_param*inertia + new_param*(1-inertia), so an inertia of 0 means ignore the old parameters, whereas an inertia of 1 means ignore the new parameters. Default is 0.0.

pseudocountdouble, optional

A pseudocount to add to the emission of each distribution. This effectively smoothes the states to prevent 0. probability symbols if they don’t happen to occur in the data. If discrete data, will smooth both the prior probabilities of each component and the emissions of each component. Otherwise, will only smooth the prior probabilities of each component. Default is 0.

Returns
None
from_yaml()

Deserialize this object from its YAML representation.

log_probability()

Calculate the log probability of a point under the distribution.

The probability of a point is the sum of the probabilities of each distribution multiplied by the weights. Thus, the log probability is the sum of the log probability plus the log prior.

This is the python interface.

Parameters
Xnumpy.ndarray, shape=(n, d) or (n, m, d)

The samples to calculate the log probability of. Each row is a sample and each column is a dimension. If emissions are HMMs then shape is (n, m, d) where m is variable length for each observation, and X becomes an array of n (m, d)-shaped arrays.

n_jobsint, optional

The number of jobs to use to parallelize, either the number of threads or the number of processes to use. -1 means use all available resources. Default is 1.

batch_size: int or None, optional

The size of the batches to make predictions on. Passing in None means splitting the data set evenly among the number of jobs. Default is None.

Returns
log_probabilitydouble

The log probability of the point under the distribution.

predict()

Predict the most likely component which generated each sample.

Calculate the posterior P(M|D) for each sample and return the index of the component most likely to fit it. This corresponds to a simple argmax over the responsibility matrix.

This is a sklearn wrapper for the maximum_a_posteriori method.

Parameters
Xarray-like, shape (n_samples, n_dimensions)

The samples to do the prediction on. Each sample is a row and each column corresponds to a dimension in that sample. For univariate distributions, a single array may be passed in.

n_jobsint

The number of jobs to use to parallelize, either the number of threads or the number of processes to use. -1 means use all available resources. Default is 1.

batch_size: int or None, optional

The size of the batches to make predictions on. Passing in None means splitting the data set evenly among the number of jobs. Default is None.

Returns
yarray-like, shape (n_samples,)

The predicted component which fits the sample the best.

predict_log_proba()

Calculate the posterior log P(M|D) for data.

Calculate the log probability of each item having been generated from each component in the model. This returns normalized log probabilities such that the probabilities should sum to 1

This is a sklearn wrapper for the original posterior function.

Parameters
Xarray-like, shape (n_samples, n_dimensions)

The samples to do the prediction on. Each sample is a row and each column corresponds to a dimension in that sample. For univariate distributions, a single array may be passed in.

n_jobsint

The number of jobs to use to parallelize, either the number of threads or the number of processes to use. -1 means use all available resources. Default is 1.

batch_size: int or None, optional

The size of the batches to make predictions on. Passing in None means splitting the data set evenly among the number of jobs. Default is None.

Returns
yarray-like, shape (n_samples, n_components)

The normalized log probability log P(M|D) for each sample. This is the probability that the sample was generated from each component.

predict_proba()

Calculate the posterior P(M|D) for data.

Calculate the probability of each item having been generated from each component in the model. This returns normalized probabilities such that each row should sum to 1.

Since calculating the log probability is much faster, this is just a wrapper which exponentiates the log probability matrix.

Parameters
Xarray-like, shape (n_samples, n_dimensions)

The samples to do the prediction on. Each sample is a row and each column corresponds to a dimension in that sample. For univariate distributions, a single array may be passed in.

n_jobsint

The number of jobs to use to parallelize, either the number of threads or the number of processes to use. -1 means use all available resources. Default is 1.

batch_size: int or None, optional

The size of the batches to make predictions on. Passing in None means splitting the data set evenly among the number of jobs. Default is None.

Returns
probabilityarray-like, shape (n_samples, n_components)

The normalized probability P(M|D) for each sample. This is the probability that the sample was generated from each component.

probability()

Return the probability of the given symbol under this distribution.

Parameters
symbolobject

The symbol to calculate the probability of

Returns
probabilitydouble

The probability of that point under the distribution.

sample()

Generate a sample from the model.

First, randomly select a component weighted by the prior probability, Then, use the sample method from that component to generate a sample.

Parameters
nint, optional

The number of samples to generate. Defaults to 1.

random_stateint, numpy.random.RandomState, or None

The random state used for generating samples. If set to none, a random seed will be used. If set to either an integer or a random seed, will produce deterministic outputs.

Returns
samplearray-like or object

A randomly generated sample from the model of the type modelled by the emissions. An integer if using most distributions, or an array if using multivariate ones, or a string for most discrete distributions. If n=1 return an object, if n>1 return an array of the samples.

score()

Return the accuracy of the model on a data set.

Parameters
Xnumpy.ndarray, shape=(n, d)

The values of the data set

ynumpy.ndarray, shape=(n,)

The labels of each value

summarize()

Summarize a batch of data and store sufficient statistics.

This will run the expectation step of EM and store sufficient statistics in the appropriate distribution objects. The summarization can be thought of as a chunk of the E step, and the from_summaries method as the M step.

Parameters
Xarray-like, shape (n_samples, n_dimensions)

This is the data to train on. Each row is a sample, and each column is a dimension to train on.

weightsarray-like, shape (n_samples,), optional

The initial weights of each sample in the matrix. If nothing is passed in then each sample is assumed to be the same weight. Default is None.

Returns
logpdouble

The log probability of the data given the current model. This is used to speed up EM.

thaw()

Thaw the distribution, re-allowing updates to occur.

to_dict()

Serialize this object to a dictionary of parameters.

to_json()

Serialize the model to JSON.

Parameters
separatorstuple, optional

The two separators to pass to the json.dumps function for formatting. Default is (‘,’, ‘ : ‘).

indentint, optional

The indentation to use at each level. Passed to json.dumps for formatting. Default is 4.

Returns
jsonstr

A properly formatted JSON object.

to_yaml()

Serialize the model to YAML for compactness.