Probability Distributions¶

IPython Notebook Tutorial

While probability distributions are frequently used as components of more complex models such as mixtures and hidden Markov models, they can also be used by themselves. Many data science tasks require fitting a distribution to data or generating samples under a distribution. pomegranate has a large library of both univariate and multivariate distributions which can be used with an intuitive interface.

Univariate Distributions

`UniformDistribution`	A uniform distribution between two values.
`BernoulliDistribution`	A Bernoulli distribution describing the probability of a binary variable.
`NormalDistribution`	A normal distribution based on a mean and standard deviation.
`LogNormalDistribution`	Represents a lognormal distribution over non-negative floats.
`ExponentialDistribution`	Represents an exponential distribution on non-negative floats.
`PoissonDistribution`	A discrete probability distribution which expresses the probability of a number of events occuring in a fixed time window.
`BetaDistribution`	This distribution represents a beta distribution, parameterized using alpha/beta, which are both shape parameters.
`GammaDistribution`	This distribution represents a gamma distribution, parameterized in the alpha/beta (shape/rate) parameterization.
`DiscreteDistribution`	A discrete distribution, made up of characters and their probabilities, assuming that these probabilities will sum to 1.0.

Kernel Densities

`GaussianKernelDensity`	A quick way of storing points to represent a Gaussian kernel density in one dimension.
`UniformKernelDensity`	A quick way of storing points to represent an Exponential kernel density in one dimension.
`TriangleKernelDensity`	A quick way of storing points to represent an Exponential kernel density in one dimension.

Multivariate Distributions

`IndependentComponentsDistribution`	Allows you to create a multivariate distribution, where each distribution is independent of the others.
`MultivariateGaussianDistribution`
`DirichletDistribution`	A Dirichlet distribution, usually a prior for the multinomial distributions.
`ConditionalProbabilityTable`	A conditional probability table, which is dependent on values from at least one previous distribution but up to as many as you want to encode for.
`JointProbabilityTable`	A joint probability table.

While there is a large variety of univariate distributions, multivariate distributions can be made from univariate distributions by using `IndependentComponentsDistribution` with the assumption that each column of data is independent from the other columns (instead of being related by a covariance matrix, like in multivariate gaussians). Here is an example:

>>> d1 = NormalDistribution(5, 2)
>>> d2 = LogNormalDistribution(1, 0.3)
>>> d3 = ExponentialDistribution(4)
>>> d = IndependentComponentsDistribution([d1, d2, d3])

Initialization¶

Initializing a distribution is simple and done just by passing in the distribution parameters. For example, the parameters of a normal distribution are the mean (mu) and the standard deviation (sigma). We can initialize it as follows:

>>> from pomegranate import *
>>> a = NormalDistribution(5, 2)

However, frequently we don’t know the parameters of the distribution beforehand or would like to directly fit this distribution to some data. We can do this through the from_samples class method.

>>> b = NormalDistribution.from_samples([3, 4, 5, 6, 7], weights=[0.5, 1, 1.5, 1, 0.5])

If we want to fit the model to weighted samples, we can just pass in an array of the relative weights of each sample as well.

>>> b = NormalDistribution.from_samples([3, 4, 5, 6, 7], weights=[0.5, 1, 1.5, 1, 0.5])

Probability¶

Distributions are typically used to calculate the probability of some sample. This can be done using either the probability or log_probability methods.

>>> a = NormalDistribution(5, 2)
>>> a.log_probability(8)
-2.737085713764219
>>> a.probability(8)
0.064758797832971712
>>> b = NormalDistribution.from_samples([3, 4, 5, 6, 7], weights=[0.5, 1, 1.5, 1, 0.5])
>>> b.log_probability(8)
-4.437779569430167

These methods work for univariate distributions, kernel densities, and multivariate distributions all the same. For a multivariate distribution you’ll have to pass in an array for the full sample.

>>> d1 = NormalDistribution(5, 2)
>>> d2 = LogNormalDistribution(1, 0.3)
>>> d3 = ExponentialDistribution(4)
>>> d = IndependentComponentsDistribution([d1, d2, d3])
>>>
>>> X = [6.2, 0.4, 0.9]
>>> d.log_probability(X)
-23.205411733352875

Fitting¶

We may wish to fit the distribution to new data, either overriding the previous parameters completely or moving the parameters to match the dataset more closely through inertia. Distributions are updated using maximum likelihood estimates (MLE). Kernel densities will either discard previous points or downweight them if inertia is used.

>>> d = NormalDistribution(5, 2)
>>> d.fit([1, 5, 7, 3, 2, 4, 3, 5, 7, 8, 2, 4, 6, 7, 2, 4, 5, 1, 3, 2, 1])
>>> d
{
    "frozen" :false,
    "class" :"Distribution",
    "parameters" :[
        3.9047619047619047,
        2.13596776114341
    ],
    "name" :"NormalDistribution"
}

Training can be done on weighted samples by passing an array of weights in along with the data for any of the training functions, like the following:

>>> d = NormalDistribution(5, 2)
>>> d.fit([1, 5, 7, 3, 2, 4], weights=[0.5, 0.75, 1, 1.25, 1.8, 0.33])
>>> d
{
    "frozen" :false,
    "class" :"Distribution",
    "parameters" :[
        3.538188277087034,
        1.954149818564894
    ],
    "name" :"NormalDistribution"
}

Training can also be done with inertia, where the new value will be some percentage the old value and some percentage the new value, used like d.from_sample([5,7,8], inertia=0.5) to indicate a 50-50 split between old and new values.

API Reference¶

class pomegranate.distributions.Distribution¶

A probability distribution.

Represents a probability distribution over the defined support. This is the base class which must be subclassed to specific probability distributions. All distributions have the below methods exposed.

Parameters:	Varies on distribution.

Attributes

name	(str) The name of the type of distributioon.
summaries	(list) Sufficient statistics to store the update.
frozen	(bool) Whether or not the distribution will be updated during training.
d	(int) The dimensionality of the data. Univariate distributions are all 1, while multivariate distributions are > 1.

clear_summaries()¶: Clear the summary statistics stored in the object. Parameters ———- None Returns ——- None

copy()¶

Return a deep copy of this distribution object.

This object will not be tied to any other distribution or connected in any form.

Returns:

distribution : Distribution

A copy of the distribution with the same parameters.

from_json()¶

Read in a serialized distribution and return the appropriate object.

Parameters:

s : str

A JSON formatted string containing the file.

Returns:

model : object

A properly initialized and baked model.

from_summaries()¶

Fit the distribution to the stored sufficient statistics. Parameters ———- inertia : double, optional

The weight of the previous parameters of the model. The new parameters will roughly be old_param*inertia + new_param*(1-inertia), so an inertia of 0 means ignore the old parameters, whereas an inertia of 1 means ignore the new parameters. Default is 0.0.

None

log_probability()¶

Return the log probability of the given symbol under this distribution.

Parameters:

symbol : double

The symbol to calculate the log probability of (overriden for DiscreteDistributions)

Returns:

logp : double

The log probability of that point under the distribution.

marginal()¶

Return the marginal of the distribution.

Parameters:

*args : optional

Arguments to pass in to specific distributions

**kwargs : optional

Keyword arguments to pass in to specific distributions

Returns:

distribution : Distribution

The marginal distribution. If this is a multivariate distribution then this method is filled in. Otherwise returns self.

plot()¶

Plot the distribution by sampling from it.

This function will plot a histogram of samples drawn from a distribution on the current open figure.

Parameters:

n : int, optional

The number of samples to draw from the distribution. Default is 1000.

**kwargs : arguments, optional

Arguments to pass to matplotlib’s histogram function.

Returns:

None

summarize()¶

Summarize a batch of data into sufficient statistics for a later update. Parameters ———- items : array-like, shape (n_samples, n_dimensions)

This is the data to train on. Each row is a sample, and each column is a dimension to train on. For univariate distributions an array is used, while for multivariate distributions a 2d matrix is used.

weights : array-like, shape (n_samples,), optional: The initial weights of each sample in the matrix. If nothing is passed in then each sample is assumed to be the same weight. Default is None.

None

to_json()¶

Serialize the distribution to a JSON.

Parameters:

separators : tuple, optional

The two separaters to pass to the json.dumps function for formatting. Default is (‘,’, ‘ : ‘).

indent : int, optional

The indentation to use at each level. Passed to json.dumps for formatting. Default is 4.

Returns:

json : str

A properly formatted JSON object.