General Mixture Models¶

IPython Notebook Tutorial

General Mixture models (GMMs) are an unsupervised probabilistic model composed of multiple distributions (commonly referred to as components) and corresponding weights. This allows you to model more complex distributions corresponding to a singular underlying phenomena. For a full tutorial on what a mixture model is and how to use them, see the above tutorial.

Initialization¶

General Mixture Models can be initialized in two ways depending on if you know the initial parameters of the model or not: (1) passing in a list of pre-initialized distributions, or (2) running the from_samples class method on data. The initial parameters can be either a pre-specified model that is ready to be used for prediction, or the initialization for expectation-maximization. Otherwise, if the second initialization option is chosen, then k-means is used to initialize the distributions. The distributions passed for each component don’t have to be the same type, and if an IndependentComponentDistribution object is passed in, then the dimensions don’t need to be modeled by the same distribution.

Here is an example of a traditional multivariate Gaussian mixture where we pass in pre-initialized distributions. We can also pass in the weight of each component, which serves as the prior probability of a sample belonging to that component when doing predictions.

from pomegranate import *
d1 = MultivariateGaussianDistribution([1, 6, 3], [[1, 0, 0], [0, 1, 0], [0, 0, 1]])
d2 = MultivariateGaussianDistribution([2, 8, 4], [[1, 0, 0], [0, 1, 0], [0, 0, 2]])
d3 = MultivariateGaussianDistribution([0, 4, 8], [[2, 0, 0], [0, 3, 0], [0, 0, 1]])
model = GeneralMixtureModel([d1, d2, d3], weights=[0.25, 0.60, 0.15])

Alternatively, if we want to model each dimension differently, then we can replace the multivariate Gaussian distributions with IndependentComponentsDistribution objects.

from pomegranate import *
d1 = IndependentComponentsDistributions([NormalDistribution(5, 2), ExponentialDistribution(1), LogNormalDistribution(0.4, 0.1)])
d2 = IndependentComponentsDistributions([NormalDistribution(3, 1), ExponentialDistribution(2), LogNormalDistribution(0.8, 0.2)])
model = GeneralMixtureModel([d1, d2], weights=[0.66, 0.34])

If we do not know the parameters of our distributions beforehand and want to learn them entirely from data, then we can use the from_samples class method. This method will run k-means to initialize the components, using the returned clusters to initialize all parameters of the distributions, i.e. both mean and covariances for multivariate Gaussian distributions. Afterwards, execptation-maximization is used to refine the parameters of the model, iterating until convergence.

from pomegranate import *
model = GeneralMixtureModel.from_samples(MultivariateGaussianDistribution, n_components=3, X=X)

If we want to model each dimension using a different distribution, then we can pass in a list of callables and they will be initialized using k-means as well.

from pomegranate import *
model = GeneralMixtureModel.from_samples([NormalDistribution, ExponentialDistribution, LogNormalDistribution], n_components=5, X=X)

Probability¶

The probability of a point is the sum of its probability under each of the components, multiplied by the weight of each component c, \(P = \sum\limits_{i \in M} P(D|M_{i})P(M_{i})\). The probability method returns the probability of each sample under the entire mixture, and the log_probability method returns the log of that value.

Prediction¶

The common prediction tasks involve predicting which component a new point falls under. This is done using Bayes rule \(P(M|D) = \frac{P(D|M)P(M)}{P(D)}\) to determine the posterior probability \(P(M|D)\) as opposed to simply the likelihood \(P(D|M)\). Bayes rule indicates that it isn’t simply the likelihood function which makes this prediction but the likelihood function multiplied by the probability that that distribution generated the sample. For example, if you have a distribution which has 100x as many samples fall under it, you would naively think that there is a ~99% chance that any random point would be drawn from it. Your belief would then be updated based on how well the point fit each distribution, but the proportion of points generated by each sample is important as well.

We can get the component label assignments using model.predict(data), which will return an array of indexes corresponding to the maximally likely component. If what we want is the full matrix of \(P(M|D)\), then we can use model.predict_proba(data), which will return a matrix with each row being a sample, each column being a component, and each cell being the probability that that model generated that data. If we want log probabilities instead we can use model.predict_log_proba(data) instead.

Fitting¶

Training GMMs faces the classic chicken-and-egg problem that most unsupervised learning algorithms face. If we knew which component a sample belonged to, we could use MLE estimates to update the component. And if we knew the parameters of the components we could predict which sample belonged to which component. This problem is solved using expectation-maximization, which iterates between the two until convergence. In essence, an initialization point is chosen which usually is not a very good start, but through successive iteration steps, the parameters converge to a good ending.

These models are fit using model.fit(data). A maximimum number of iterations can be specified as well as a stopping threshold for the improvement ratio. See the API reference for full documentation.

API Reference¶

class pomegranate.gmm.GeneralMixtureModel¶

A General Mixture Model.

This mixture model can be a mixture of any distribution as long as they are all of the same dimensionality. Any object can serve as a distribution as long as it has fit(X, weights), log_probability(X), and summarize(X, weights)/from_summaries() methods if out of core training is desired.

Parameters:	distributions : array-like, shape (n_components,) The components of the model as initialized distributions. weights : array-like, optional, shape (n_components,) The prior probabilities corresponding to each component. Does not need to sum to one, but will be normalized to sum to one internally. Defaults to None.

Examples

>>> from pomegranate import *
>>>
>>> d1 = NormalDistribution(5, 2)
>>> d2 = NormalDistribution(1, 1)
>>>
>>> clf = GeneralMixtureModel([d1, d2])
>>> clf.log_probability(5)
-2.304562194038089
>>> clf.predict_proba([[5], [7], [1]])
array([[ 0.99932952,  0.00067048],
       [ 0.99999995,  0.00000005],
       [ 0.06337894,  0.93662106]])
>>> clf.fit([[1], [5], [7], [8], [2]])
>>> clf.predict_proba([[5], [7], [1]])
array([[ 1.        ,  0.        ],
       [ 1.        ,  0.        ],
       [ 0.00004383,  0.99995617]])
>>> clf.distributions
array([ {
    "frozen" :false,
    "class" :"Distribution",
    "parameters" :[
        6.6571359101390755,
        1.2639830514274502
    ],
    "name" :"NormalDistribution"
},
       {
    "frozen" :false,
    "class" :"Distribution",
    "parameters" :[
        1.498707696758334,
        0.4999983303277837
    ],
    "name" :"NormalDistribution"
}], dtype=object)

Attributes:	distributions : array-like, shape (n_components,) The component distribution objects. weights : array-like, shape (n_components,) The learned prior weight of each object

clear_summaries()¶

Remove the stored sufficient statistics.

Parameters:	None
Returns:	None

copy()¶

Return a deep copy of this distribution object.

This object will not be tied to any other distribution or connected in any form.

Parameters:	None
Returns:	distribution : Distribution A copy of the distribution with the same parameters.

fit()¶

Fit the model to new data using EM.

This method fits the components of the model to new data using the EM method. It will iterate until either max iterations has been reached, or the stop threshold has been passed.

Parameters:

X : array-like, shape (n_samples, n_dimensions)

This is the data to train on. Each row is a sample, and each column is a dimension to train on.

weights : array-like, shape (n_samples,), optional: The initial weights of each sample in the matrix. If nothing is passed in then each sample is assumed to be the same weight. Default is None.
inertia : double, optional: The weight of the previous parameters of the model. The new parameters will roughly be old_param*inertia + new_param*(1-inertia), so an inertia of 0 means ignore the old parameters, whereas an inertia of 1 means ignore the new parameters. Default is 0.0.

pseudocount : double, optional, positive

A pseudocount to add to the emission of each distribution. This

effectively smoothes the states to prevent 0. probability symbols

if they don’t happen to occur in the data. Only effects mixture

models defined over discrete distributions. Default is 0.

stop_threshold : double, optional, positive: The threshold at which EM will terminate for the improvement of the model. If the model does not improve its fit of the data by a log probability of 0.1 then terminate. Default is 0.1.
max_iterations : int, optional, positive: The maximum number of iterations to run EM for. If this limit is hit then it will terminate training, regardless of how well the model is improving per iteration. Default is 1e8.
batch_size : int or None, optional: The number of samples in a batch to summarize on. This controls the size of the set sent to summarize and so does not make the update any less exact. This is useful when training on a memory map and cannot load all the data into memory. If set to None, batch_size is 1 / n_jobs. Default is None.
batches_per_epoch : int or None, optional: The number of batches in an epoch. This is the number of batches to summarize before calling from_summaries and updating the model parameters. This allows one to do minibatch updates by updating the model parameters before setting the full dataset. If set to None, uses the full dataset. Default is None.
lr_decay : double, optional, positive: The step size decay as a function of the number of iterations. Functionally, this sets the inertia to be (2+k)^{-lr_decay} where k is the number of iterations. This causes initial iterations to have more of an impact than later iterations, and is frequently used in minibatch learning. This value is suggested to be between 0.5 and 1. Default is 0, meaning no decay.
verbose : bool, optional: Whether or not to print out improvement information over iterations. Default is False.
n_jobs : int, optional: The number of threads to use when parallelizing the job. This parameter is passed directly into joblib. Default is 1, indicating no parallelism.

Returns:

self : GeneralMixtureModel: The fit mixture model.

freeze()¶: Freeze the distribution, preventing updates from occuring.

from_samples()¶

Create a mixture model directly from the given dataset.

First, k-means will be run using the given initializations, in order to define initial clusters for the points. These clusters are used to initialize the distributions used. Then, EM is run to refine the parameters of these distributions.

A homogenous mixture can be defined by passing in a single distribution callable as the first parameter and specifying the number of components, while a heterogeneous mixture can be defined by passing in a list of callables of the appropriate type.

Parameters:

distributions : array-like, shape (n_components,) or callable

The components of the model. If array, corresponds to the initial distributions of the components. If callable, must also pass in the number of components and kmeans++ will be used to initialize them.

n_components : int: If a callable is passed into distributions then this is the number of components to initialize using the kmeans++ algorithm.
X : array-like, shape (n_samples, n_dimensions): This is the data to train on. Each row is a sample, and each column is a dimension to train on.
weights : array-like, shape (n_samples,), optional: The initial weights of each sample in the matrix. If nothing is passed in then each sample is assumed to be the same weight. Default is None.
n_init : int, optional: The number of initializations of k-means to do before choosing the best. Default is 1.
init : str, optional: The initialization algorithm to use for the initial k-means clustering. Must be one of ‘first-k’, ‘random’, ‘kmeans++’, or ‘kmeans||’. Default is ‘kmeans++’.
max_kmeans_iterations : int, optional: The maximum number of iterations to run kmeans for in the initialization step. Default is 1.
inertia : double, optional: The weight of the previous parameters of the model. The new parameters will roughly be old_param*inertia + new_param*(1-inertia), so an inertia of 0 means ignore the old parameters, whereas an inertia of 1 means ignore the new parameters. Default is 0.0.

pseudocount : double, optional, positive

A pseudocount to add to the emission of each distribution. This

effectively smoothes the states to prevent 0. probability symbols

if they don’t happen to occur in the data. Only effects mixture

models defined over discrete distributions. Default is 0.

stop_threshold : double, optional, positive: The threshold at which EM will terminate for the improvement of the model. If the model does not improve its fit of the data by a log probability of 0.1 then terminate. Default is 0.1.
max_iterations : int, optional, positive: The maximum number of iterations to run EM for. If this limit is hit then it will terminate training, regardless of how well the model is improving per iteration. Default is 1e8.
batch_size : int or None, optional: The number of samples in a batch to summarize on. This controls the size of the set sent to summarize and so does not make the update any less exact. This is useful when training on a memory map and cannot load all the data into memory. If set to None, batch_size is 1 / n_jobs. Default is None.
batches_per_epoch : int or None, optional: The number of batches in an epoch. This is the number of batches to summarize before calling from_summaries and updating the model parameters. This allows one to do minibatch updates by updating the model parameters before setting the full dataset. If set to None, uses the full dataset. Default is None.
lr_decay : double, optional, positive: The step size decay as a function of the number of iterations. Functionally, this sets the inertia to be (2+k)^{-lr_decay} where k is the number of iterations. This causes initial iterations to have more of an impact than later iterations, and is frequently used in minibatch learning. This value is suggested to be between 0.5 and 1. Default is 0, meaning no decay.
verbose : bool, optional: Whether or not to print out improvement information over iterations. Default is False.
n_jobs : int, optional: The number of threads to use when parallelizing the job. This parameter is passed directly into joblib. Default is 1, indicating no parallelism.

from_summaries()¶

Fit the model to the collected sufficient statistics.

Fit the parameters of the model to the sufficient statistics gathered during the summarize calls. This should return an exact update.

Parameters:

inertia : double, optional: The weight of the previous parameters of the model. The new parameters will roughly be old_param*inertia + new_param*(1-inertia), so an inertia of 0 means ignore the old parameters, whereas an inertia of 1 means ignore the new parameters. Default is 0.0.
pseudocount : double, optional: A pseudocount to add to the emission of each distribution. This effectively smoothes the states to prevent 0. probability symbols if they don’t happen to occur in the data. If discrete data, will smooth both the prior probabilities of each component and the emissions of each component. Otherwise, will only smooth the prior probabilities of each component. Default is 0.

Returns:

None

log_probability()¶

Calculate the log probability of a point under the distribution.

The probability of a point is the sum of the probabilities of each distribution multiplied by the weights. Thus, the log probability is the sum of the log probability plus the log prior.

This is the python interface.

Parameters:

X : numpy.ndarray, shape=(n, d) or (n, m, d): The samples to calculate the log probability of. Each row is a sample and each column is a dimension. If emissions are HMMs then shape is (n, m, d) where m is variable length for each obervation, and X becomes an array of n (m, d)-shaped arrays.
n_jobs : int: The number of jobs to use to parallelize, either the number of threads or the number of processes to use. -1 means use all available resources. Default is 1.

Returns:

log_probability : double: The log probabiltiy of the point under the distribution.

predict()¶

Predict the most likely component which generated each sample.

Calculate the posterior P(M|D) for each sample and return the index of the component most likely to fit it. This corresponds to a simple argmax over the responsibility matrix.

This is a sklearn wrapper for the maximum_a_posteriori method.

Parameters:

X : array-like, shape (n_samples, n_dimensions): The samples to do the prediction on. Each sample is a row and each column corresponds to a dimension in that sample. For univariate distributions, a single array may be passed in.
n_jobs : int: The number of jobs to use to parallelize, either the number of threads or the number of processes to use. -1 means use all available resources. Default is 1.

Returns:

y : array-like, shape (n_samples,): The predicted component which fits the sample the best.

predict_log_proba()¶

Calculate the posterior log P(M|D) for data.

Calculate the log probability of each item having been generated from each component in the model. This returns normalized log probabilities such that the probabilities should sum to 1

This is a sklearn wrapper for the original posterior function.

Parameters:

X : array-like, shape (n_samples, n_dimensions): The samples to do the prediction on. Each sample is a row and each column corresponds to a dimension in that sample. For univariate distributions, a single array may be passed in.
n_jobs : int: The number of jobs to use to parallelize, either the number of threads or the number of processes to use. -1 means use all available resources. Default is 1.

Returns:

y : array-like, shape (n_samples, n_components): The normalized log probability log P(M|D) for each sample. This is the probability that the sample was generated from each component.

predict_proba()¶

Calculate the posterior P(M|D) for data.

Calculate the probability of each item having been generated from each component in the model. This returns normalized probabilities such that each row should sum to 1.

Since calculating the log probability is much faster, this is just a wrapper which exponentiates the log probability matrix.

Parameters:

X : array-like, shape (n_samples, n_dimensions): The samples to do the prediction on. Each sample is a row and each column corresponds to a dimension in that sample. For univariate distributions, a single array may be passed in.
n_jobs : int: The number of jobs to use to parallelize, either the number of threads or the number of processes to use. -1 means use all available resources. Default is 1.

Returns:

probability : array-like, shape (n_samples, n_components): The normalized probability P(M|D) for each sample. This is the probability that the sample was generated from each component.

probability()¶

Return the probability of the given symbol under this distribution.

Parameters:	symbol : object The symbol to calculate the probability of
Returns:	probability : double The probability of that point under the distribution.

sample()¶

Generate a sample from the model.

First, randomly select a component weighted by the prior probability, Then, use the sample method from that component to generate a sample.

Parameters:	n : int, optional The number of samples to generate. Defaults to 1.
Returns:	sample : array-like or object A randomly generated sample from the model of the type modelled by the emissions. An integer if using most distributions, or an array if using multivariate ones, or a string for most discrete distributions. If n=1 return an object, if n>1 return an array of the samples.

summarize()¶

Summarize a batch of data and store sufficient statistics.

This will run the expectation step of EM and store sufficient statistics in the appropriate distribution objects. The summarization can be thought of as a chunk of the E step, and the from_summaries method as the M step.

Parameters:	X : array-like, shape (n_samples, n_dimensions) This is the data to train on. Each row is a sample, and each column is a dimension to train on. weights : array-like, shape (n_samples,), optional The initial weights of each sample in the matrix. If nothing is passed in then each sample is assumed to be the same weight. Default is None.
Returns:	logp : double The log probability of the data given the current model. This is used to speed up EM.

thaw()¶: Thaw the distribution, re-allowing updates to occur.