Probability Distributions

IPython Notebook Tutorial

While probability distributions are frequently used as components of more complex models such as mixtures and hidden Markov models, they can also be used by themselves. Many data science tasks require fitting a distribution to data or generating samples under a distribution. pomegranate has a large library of both univariate and multivariate distributions which can be used with an intuitive interface.

Univariate Distributions

UniformDistribution A uniform distribution between two values.
BernoulliDistribution A Bernoulli distribution describing the probability of a binary variable.
NormalDistribution A normal distribution based on a mean and standard deviation.
LogNormalDistribution Represents a lognormal distribution over non-negative floats.
ExponentialDistribution Represents an exponential distribution on non-negative floats.
PoissonDistribution A discrete probability distribution which expresses the probability of a number of events occurring in a fixed time window.
BetaDistribution This distribution represents a beta distribution, parameterized using alpha/beta, which are both shape parameters.
GammaDistribution This distribution represents a gamma distribution, parameterized in the alpha/beta (shape/rate) parameterization.
DiscreteDistribution A discrete distribution, made up of characters and their probabilities, assuming that these probabilities will sum to 1.0.

Kernel Densities

GaussianKernelDensity A quick way of storing points to represent a Gaussian kernel density in one dimension.
UniformKernelDensity A quick way of storing points to represent an Exponential kernel density in one dimension.
TriangleKernelDensity A quick way of storing points to represent an Exponential kernel density in one dimension.

Multivariate Distributions

IndependentComponentsDistribution Allows you to create a multivariate distribution, where each distribution is independent of the others.
MultivariateGaussianDistribution
DirichletDistribution A Dirichlet distribution, usually a prior for the multinomial distributions.
ConditionalProbabilityTable A conditional probability table, which is dependent on values from at least one previous distribution but up to as many as you want to encode for.
JointProbabilityTable A joint probability table.

While there are a large variety of univariate distributions, multivariate distributions can be made from univariate distributions by using `IndependentComponentsDistribution` with the assumption that each column of data is independent from the other columns (instead of being related by a covariance matrix, like in multivariate gaussians). Here is an example:

d1 = NormalDistribution(5, 2)
d2 = LogNormalDistribution(1, 0.3)
d3 = ExponentialDistribution(4)
d = IndependentComponentsDistribution([d1, d2, d3])

Use MultivariateGaussianDistribution when you want the full correlation matrix within the feature vector. When you want a strict diagonal correlation (i.e no correlation or “independent”), this is achieved using IndependentComponentsDistribution with NormalDistribution for each feature. There is no implementation of spherical or other variations of correlation.

Initialization

Initializing a distribution is simple and done just by passing in the distribution parameters. For example, the parameters of a normal distribution are the mean (mu) and the standard deviation (sigma). We can initialize it as follows:

from pomegranate import *
a = NormalDistribution(5, 2)

However, frequently we don’t know the parameters of the distribution beforehand or would like to directly fit this distribution to some data. We can do this through the from_samples class method.

b = NormalDistribution.from_samples([3, 4, 5, 6, 7])

If we want to fit the model to weighted samples, we can just pass in an array of the relative weights of each sample as well.

b = NormalDistribution.from_samples([3, 4, 5, 6, 7], weights=[0.5, 1, 1.5, 1, 0.5])

Probability

Distributions are typically used to calculate the probability of some sample. This can be done using either the probability or log_probability methods.

a = NormalDistribution(5, 2)
a.log_probability(8)
-2.737085713764219
a.probability(8)
0.064758797832971712
b = NormalDistribution.from_samples([3, 4, 5, 6, 7], weights=[0.5, 1, 1.5, 1, 0.5])
b.log_probability(8)
-4.437779569430167

These methods work for univariate distributions, kernel densities, and multivariate distributions all the same. For a multivariate distribution you’ll have to pass in an array for the full sample.

d1 = NormalDistribution(5, 2)
d2 = LogNormalDistribution(1, 0.3)
d3 = ExponentialDistribution(4)
d = IndependentComponentsDistribution([d1, d2, d3])
>>>
X = [6.2, 0.4, 0.9]
d.log_probability(X)
-23.205411733352875

Fitting

We may wish to fit the distribution to new data, either overriding the previous parameters completely or moving the parameters to match the dataset more closely through inertia. Distributions are updated using maximum likelihood estimates (MLE). Kernel densities will either discard previous points or downweight them if inertia is used.

d = NormalDistribution(5, 2)
d.fit([1, 5, 7, 3, 2, 4, 3, 5, 7, 8, 2, 4, 6, 7, 2, 4, 5, 1, 3, 2, 1])
d
{
    "frozen" :false,
    "class" :"Distribution",
    "parameters" :[
        3.9047619047619047,
        2.13596776114341
    ],
    "name" :"NormalDistribution"
}

Training can be done on weighted samples by passing an array of weights in along with the data for any of the training functions, like the following:

d = NormalDistribution(5, 2)
d.fit([1, 5, 7, 3, 2, 4], weights=[0.5, 0.75, 1, 1.25, 1.8, 0.33])
d
{
    "frozen" :false,
    "class" :"Distribution",
    "parameters" :[
        3.538188277087034,
        1.954149818564894
    ],
    "name" :"NormalDistribution"
}

Training can also be done with inertia, where the new value will be some percentage the old value and some percentage the new value, used like d.from_samples([5,7,8], inertia=0.5) to indicate a 50-50 split between old and new values.

API Reference

For detailed documentation and examples, see the README.

class pomegranate.distributions.BernoulliDistribution

A Bernoulli distribution describing the probability of a binary variable.

from_summaries()

Update the parameters of the distribution from the summaries.

sample()

Return a random item sampled from this distribution.

Parameters:
n : int or None, optional

The number of samples to return. Default is None, which is to generate a single sample.

Returns:
sample : double or object

Returns a sample from the distribution of a type in the support of the distribution.

class pomegranate.distributions.BetaDistribution

This distribution represents a beta distribution, parameterized using alpha/beta, which are both shape parameters. ML estimation is done

clear_summaries()

Clear the summary statistics stored in the object.

from_summaries()

Use the summaries in order to update the distribution.

sample()

Return a random sample from the beta distribution.

class pomegranate.distributions.ConditionalProbabilityTable

A conditional probability table, which is dependent on values from at least one previous distribution but up to as many as you want to encode for.

bake()

Order the inputs according to some external global ordering.

clear_summaries()

Clear the summary statistics stored in the object.

fit()

Update the parameters of the table based on the data.

from_samples()

Learn the table from data.

from_summaries()

Update the parameters of the distribution using sufficient statistics.

joint()

This will turn a conditional probability table into a joint probability table. If the data is already a joint, it will likely mess up the data. It does so by scaling the parameters the probabilities by the parent distributions.

keys()

Return the keys of the probability distribution which has parents, the child variable.

log_probability()

Return the log probability of a value, which is a tuple in proper ordering, like the training data.

marginal()

Calculate the marginal of the CPT. This involves normalizing to turn it into a joint probability table, and then summing over the desired value.

sample()

Return a random sample from the conditional probability table.

summarize()

Summarize the data into sufficient statistics to store.

to_json()

Serialize the model to a JSON.

Parameters:
separators : tuple, optional

The two separators to pass to the json.dumps function for formatting. Default is (‘,’, ‘ : ‘).

indent : int, optional

The indentation to use at each level. Passed to json.dumps for formatting. Default is 4.

Returns:
json : str

A properly formatted JSON object.

class pomegranate.distributions.DirichletDistribution

A Dirichlet distribution, usually a prior for the multinomial distributions.

clear_summaries()

Clear the summary statistics stored in the object. Parameters ———- None Returns ——- None

fit()

Set the parameters of this Distribution to maximize the likelihood of the given sample. Items holds some sort of sequence. If weights is specified, it holds a sequence of value to weight each item by.

from_samples()

Fit a distribution to some data without pre-specifying it.

from_summaries()

Update the internal parameters of the distribution.

sample()

Return a random item sampled from this distribution.

Parameters:
n : int or None, optional

The number of samples to return. Default is None, which is to generate a single sample.

Returns:
sample : double or object

Returns a sample from the distribution of a type in the support of the distribution.

class pomegranate.distributions.DiscreteDistribution

A discrete distribution, made up of characters and their probabilities, assuming that these probabilities will sum to 1.0.

bake()

Encoding the distribution into integers.

clamp()

Return a distribution clamped to a particular value.

clear_summaries()

Clear the summary statistics stored in the object.

equals()

Return if the keys and values are equal

fit()

Set the parameters of this Distribution to maximize the likelihood of the given sample. Items holds some sort of sequence. If weights is specified, it holds a sequence of value to weight each item by.

from_samples()

Fit a distribution to some data without pre-specifying it.

from_summaries()

Use the summaries in order to update the distribution.

items()

Return items of the underlying dictionary.

keys()

Return the keys of the underlying dictionary.

log_probability()

Return the log prob of the X under this distribution.

mle()

Return the maximally likely key.

sample()

Return a random item sampled from this distribution.

Parameters:
n : int or None, optional

The number of samples to return. Default is None, which is to generate a single sample.

Returns:
sample : double or object

Returns a sample from the distribution of a type in the support of the distribution.

summarize()

Reduce a set of observations to sufficient statistics.

to_json()

Serialize the distribution to a JSON.

Parameters:
separators : tuple, optional

The two separators to pass to the json.dumps function for formatting. Default is (‘,’, ‘ : ‘).

indent : int, optional

The indentation to use at each level. Passed to json.dumps for formatting. Default is 4.

Returns:
json : str

A properly formatted JSON object.

values()

Return values of the underlying dictionary.

class pomegranate.distributions.ExponentialDistribution

Represents an exponential distribution on non-negative floats.

clear_summaries()

Clear the summary statistics stored in the object.

from_summaries()

Takes in a series of summaries, represented as a mean, a variance, and a weight, and updates the underlying distribution. Notes on how to do this for a Gaussian distribution were taken from here: http://math.stackexchange.com/questions/453113/how-to-merge-two-gaussians

sample()

Return a random item sampled from this distribution.

Parameters:
n : int or None, optional

The number of samples to return. Default is None, which is to generate a single sample.

Returns:
sample : double or object

Returns a sample from the distribution of a type in the support of the distribution.

class pomegranate.distributions.GammaDistribution

This distribution represents a gamma distribution, parameterized in the alpha/beta (shape/rate) parameterization. ML estimation for a gamma distribution, taking into account weights on the data, is nontrivial, and I was unable to find a good theoretical source for how to do it, so I have cobbled together a solution here from less-reputable sources.

clear_summaries()

Clear the summary statistics stored in the object.

fit()

Set the parameters of this Distribution to maximize the likelihood of the given sample. Items holds some sort of sequence. If weights is specified, it holds a sequence of value to weight each item by. In the Gamma case, likelihood maximization is necesarily numerical, and the extension to weighted values is not trivially obvious. The algorithm used here includes a Newton-Raphson step for shape parameter estimation, and analytical calculation of the rate parameter. The extension to weights is constructed using vital information found way down at the bottom of an Experts Exchange page. Newton-Raphson continues until the change in the parameter is less than epsilon, or until iteration_limit is reached See: http://en.wikipedia.org/wiki/Gamma_distribution http://www.experts-exchange.com/Other/Math_Science/Q_23943764.html

from_summaries()

Set the parameters of this Distribution to maximize the likelihood of the given sample given the summaries which have been stored. In the Gamma case, likelihood maximization is necesarily numerical, and the extension to weighted values is not trivially obvious. The algorithm used here includes a Newton-Raphson step for shape parameter estimation, and analytical calculation of the rate parameter. The extension to weights is constructed using vital information found way down at the bottom of an Experts Exchange page. Newton-Raphson continues until the change in the parameter is less than epsilon, or until iteration_limit is reached See: http://en.wikipedia.org/wiki/Gamma_distribution http://www.experts-exchange.com/Other/Math_Science/Q_23943764.html

sample()

Return a random item sampled from this distribution.

Parameters:
n : int or None, optional

The number of samples to return. Default is None, which is to generate a single sample.

Returns:
sample : double or object

Returns a sample from the distribution of a type in the support of the distribution.

summarize()

Take in a series of items and their weights and reduce it down to a summary statistic to be used in training later.

class pomegranate.distributions.IndependentComponentsDistribution

Allows you to create a multivariate distribution, where each distribution is independent of the others. Distributions can be any type, such as having an exponential represent the duration of an event, and a normal represent the mean of that event. Observations must now be tuples of a length equal to the number of distributions passed in.

s1 = IndependentComponentsDistribution([ExponentialDistribution(0.1),
NormalDistribution(5, 2)])

s1.log_probability((5, 2))

clear_summaries()

Clear the summary statistics stored in the object.

fit()

Set the parameters of this Distribution to maximize the likelihood of the given sample. Items holds some sort of sequence. If weights is specified, it holds a sequence of value to weight each item by.

from_samples()

Create a new independent components distribution from data.

from_summaries()

Use the collected summary statistics in order to update the distributions.

log_probability()

What’s the probability of a given tuple under this mixture? It’s the product of the probabilities of each X in the tuple under their respective distribution, which is the sum of the log probabilities.

sample()

Return a random item sampled from this distribution.

Parameters:
n : int or None, optional

The number of samples to return. Default is None, which is to generate a single sample.

Returns:
sample : double or object

Returns a sample from the distribution of a type in the support of the distribution.

summarize()

Take in an array of items and reduce it down to summary statistics. For a multivariate distribution, this involves just passing the appropriate data down to the appropriate distributions.

to_json()

Convert the distribution to JSON format.

class pomegranate.distributions.JointProbabilityTable

A joint probability table. The primary difference between this and the conditional table is that the final column sums to one here. The joint table can be thought of as the conditional probability table normalized by the marginals of each parent.

bake()

Order the inputs according to some external global ordering.

clear_summaries()

Clear the summary statistics stored in the object.

fit()

Update the parameters of the table based on the data.

from_samples()

Learn the table from data.

from_summaries()

Update the parameters of the distribution using sufficient statistics.

log_probability()

Return the log probability of a value, which is a tuple in proper ordering, like the training data.

marginal()

Determine the marginal of this table with respect to the index of one variable. The parents are index 0..n-1 for n parents, and the final variable is either the appropriate value or -1. For example: table = A B C p(C) … data … table.marginal(0) gives the marginal wrt A table.marginal(1) gives the marginal wrt B table.marginal(2) gives the marginal wrt C table.marginal(-1) gives the marginal wrt C

sample()

Return a random item sampled from this distribution.

Parameters:
n : int or None, optional

The number of samples to return. Default is None, which is to generate a single sample.

Returns:
sample : double or object

Returns a sample from the distribution of a type in the support of the distribution.

summarize()

Summarize the data into sufficient statistics to store.

to_json()

Serialize the model to a JSON.

Parameters:
separators : tuple, optional

The two separators to pass to the json.dumps function for formatting. Default is (‘,’, ‘ : ‘).

indent : int, optional

The indentation to use at each level. Passed to json.dumps for formatting. Default is 4.

Returns:
json : str

A properly formatted JSON object.

class pomegranate.distributions.LogNormalDistribution

Represents a lognormal distribution over non-negative floats.

clear_summaries()

Clear the summary statistics stored in the object.

from_summaries()

Takes in a series of summaries, represented as a mean, a variance, and a weight, and updates the underlying distribution. Notes on how to do this for a Gaussian distribution were taken from here: http://math.stackexchange.com/questions/453113/how-to-merge-two-gaussians

sample()

Return a sample from this distribution.

class pomegranate.distributions.MultivariateGaussianDistribution
clear_summaries()

Clear the summary statistics stored in the object.

from_samples()

Fit a distribution to some data without pre-specifying it.

from_summaries()

Set the parameters of this Distribution to maximize the likelihood of the given sample. Items holds some sort of sequence. If weights is specified, it holds a sequence of value to weight each item by.

sample()

Return a random item sampled from this distribution.

Parameters:
n : int or None, optional

The number of samples to return. Default is None, which is to generate a single sample.

Returns:
sample : double or object

Returns a sample from the distribution of a type in the support of the distribution.

class pomegranate.distributions.NormalDistribution

A normal distribution based on a mean and standard deviation.

clear_summaries()

Clear the summary statistics stored in the object.

from_summaries()

Takes in a series of summaries, represented as a mean, a variance, and a weight, and updates the underlying distribution. Notes on how to do this for a Gaussian distribution were taken from here: http://math.stackexchange.com/questions/453113/how-to-merge-two-gaussians

sample()

Return a random item sampled from this distribution.

Parameters:
n : int or None, optional

The number of samples to return. Default is None, which is to generate a single sample.

Returns:
sample : double or object

Returns a sample from the distribution of a type in the support of the distribution.

class pomegranate.distributions.PoissonDistribution

A discrete probability distribution which expresses the probability of a number of events occurring in a fixed time window. It assumes these events occur with at a known rate, and independently of each other.

clear_summaries()

Clear the summary statistics stored in the object.

from_summaries()

Takes in a series of summaries, consisting of the minimum and maximum of a sample, and determine the global minimum and maximum.

sample()

Return a random item sampled from this distribution.

Parameters:
n : int or None, optional

The number of samples to return. Default is None, which is to generate a single sample.

Returns:
sample : double or object

Returns a sample from the distribution of a type in the support of the distribution.

class pomegranate.distributions.UniformDistribution

A uniform distribution between two values.

clear_summaries()

Clear the summary statistics stored in the object.

from_summaries()

Takes in a series of summaries, consisting of the minimum and maximum of a sample, and determine the global minimum and maximum.

sample()

Sample from this uniform distribution and return the value sampled.