# Bayes Classifiers and Naive Bayes¶

IPython Notebook Tutorial

Bayes classifiers are simple probabilistic classification models based off of Bayes theorem. See the above tutorial for a full primer on how they work, and what the distinction between a naive Bayes classifier and a Bayes classifier is. Essentially, each class is modeled by a probability distribution and classifications are made according to what distribution fits the data the best. They are a supervised version of general mixture models, in that the `predict`, `predict_proba`, and `predict_log_proba` methods return the same values for the same underlying distributions, but that instead of using expectation-maximization to fit to new data they can use the provided labels directly.

## Initialization¶

Bayes classifiers and naive Bayes can both be initialized in one of two ways depending on if you know the parameters of the model beforehand or not, (1) passing in a list of pre-initialized distributions to the model, or (2) using the `from_samples` class method to initialize the model directly from data. For naive Bayes models on multivariate data, the pre-initialized distributions must be a list of `IndependentComponentDistribution` objects since each dimension is modeled independently from the others. For Bayes classifiers on multivariate data a list of any type of multivariate distribution can be provided. For univariate data the two models produce identical results, and can be passed in a list of univariate distributions. For example:

```from pomegranate import *
d1 = IndependentComponentsDistribution([NormalDistribution(5, 2), NormalDistribution(6, 1), NormalDistribution(9, 1)])
d2 = IndependentComponentsDistribution([NormalDistribution(2, 1), NormalDistribution(8, 1), NormalDistribution(5, 1)])
d3 = IndependentComponentsDistribution([NormalDistribution(3, 1), NormalDistribution(5, 3), NormalDistribution(4, 1)])
model = NaiveBayes([d1, d2, d3])
```

would create a three class naive Bayes classifier that modeled data with three dimensions. Alternatively, we can initialize a Bayes classifier in the following manner

```from pomegranate import *
d1 = MultivariateGaussianDistribution([5, 6, 9], [[2, 0, 0], [0, 1, 0], [0, 0, 1]])
d2 = MultivariateGaussianDistribution([2, 8, 5], [[1, 0, 0], [0, 1, 0], [0, 0, 1]])
d3 = MultivariateGaussianDistribution([3, 5, 4], [[1, 0, 0], [0, 3, 0], [0, 0, 1]])
model = BayesClassifier([d1, d2, d3])
```

The two examples above functionally create the same model, as the Bayes classifier uses multivariate Gaussian distributions with the same means and a diagonal covariance matrix containing only the variances. However, if we were to fit these models to data later on, the Bayes classifier would learn a full covariance matrix while the naive Bayes would only learn the diagonal.

If we instead wish to initialize our model directly onto data, we use the `from_samples` class method.

```from pomegranate import *
import numpy
model = NaiveBayes.from_samples(NormalDistribution, X, y)
```

This would create a naive Bayes model directly from the data with normal distributions modeling each of the dimensions, and a number of components equal to the number of classes in `y`. Alternatively if we wanted to create a model with different distributions for each dimension we can do the following:

```>>> model = NaiveBayes.from_samples([NormalDistribution, ExponentialDistribution], X, y)
```

This assumes that your data is two dimensional and that you want to model the first distribution as a normal distribution and the second dimension as an exponential distribution.

We can do pretty much the same thing with Bayes classifiers, except passing in a more complex model.

```>>> model = BayesClassifier.from_samples(MultivariateGaussianDistribution, X, y)
```

One can use much more complex models than just a multivariate Gaussian with a full covariance matrix when using a Bayes classifier. Specifically, you can also have your distributions be general mixture models, hidden Markov models, and Bayesian networks. For example:

```>>> model = BayesClassifier.from_samples(BayesianNetwork, X, y)
```

That would require that the data is only discrete valued currently, and the structure learning task may be too long if not set appropriately. However, it is possible. Currently, one cannot simply put in GeneralMixtureModel or HiddenMarkovModel despite them having a `from_samples` method because there is a great deal of flexibility in terms of the structure or emission distributions. The easiest way to set up one of these more complex models is to build each of the components separately and then feed them into the Bayes classifier method using the first initialization method.

```>>> d1 = GeneralMixtureModel.from_samples(MultivariateGaussianDistribution, n_components=5, X=X[y==0])
>>> d2 = GeneralMixtureModel.from_samples(MultivariateGaussianDistribution, n_components=5, X=X[y==1])
>>> model = BayesClassifier([d1, d2])
```

## Prediction¶

Bayes classifiers and naive Bayes supports the same three prediction methods that the other models support, `predict`, `predict_proba`, and `predict_log_proba`. These methods return the most likely class given the data (argmax_m P(M|D)), the probability of each class given the data (P(M|D)), and the log probability of each class given the data (log P(M|D)). It is best to always pass in a 2D matrix even for univariate data, where it would have a shape of (n, 1).

The `predict` method takes in samples and returns the most likely class given the data.

```from pomegranate import *
model = NaiveBayes([NormalDistribution(5, 2), UniformDistribution(0, 10), ExponentialDistribution(1.0)])
model.predict( np.array([[0], [1], [2], [3], [4]]))
[2, 2, 2, 0, 0]
```

Calling `predict_proba` on five samples for a Naive Bayes with univariate components would look like the following.

```from pomegranate import *
model = NaiveBayes([NormalDistribution(5, 2), UniformDistribution(0, 10), ExponentialDistribution(1)])
model.predict_proba(np.array([[0], [1], [2], [3], [4]]))
[[ 0.00790443  0.09019051  0.90190506]
[ 0.05455011  0.20207126  0.74337863]
[ 0.21579499  0.33322883  0.45097618]
[ 0.44681566  0.36931382  0.18387052]
[ 0.59804205  0.33973357  0.06222437]]
```

Multivariate models work the same way.

```from pomegranate import *
d1 = MultivariateGaussianDistribution([5, 5], [[1, 0], [0, 1]])
d2 = IndependentComponentsDistribution([NormalDistribution(5, 2), NormalDistribution(5, 2)])
model = BayesClassifier([d1, d2])
clf.predict_proba(np.array([[0, 4],
[1, 3],
[2, 2],
[3, 1],
[4, 0]]))
array([[ 0.00023312,  0.99976688],
[ 0.00220745,  0.99779255],
[ 0.00466169,  0.99533831],
[ 0.00220745,  0.99779255],
[ 0.00023312,  0.99976688]])
```

`predict_log_proba` works the same way, returning the log probabilities instead of the probabilities.

## Fitting¶

Both naive Bayes and Bayes classifiers also have a `fit` method that updates the parameters of the model based on new data. The major difference between these methods and the others presented is that these are supervised methods and so need to be passed labels in addition to data. This change propagates also to the `summarize` method, where labels are provided as well.

```from pomegranate import *
d1 = MultivariateGaussianDistribution([5, 5], [[1, 0], [0, 1]])
d2 = IndependentComponentsDistribution(NormalDistribution(5, 2), NormalDistribution(5, 2)])
model = BayesClassifier([d1, d2])
X = np.array([[6.0, 5.0],
[3.5, 4.0],
[7.5, 1.5],
[7.0, 7.0 ]])
y = np.array([0, 0, 1, 1])
model.fit(X, y)
```

As we can see, there are four samples, with the first two samples labeled as class 0 and the last two samples labeled as class 1. Keep in mind that the training samples must match the input requirements for the models used. So if using a univariate distribution, then each sample must contain one item. A bivariate distribution, two. For hidden markov models, the sample can be a list of observations of any length. An example using hidden markov models would be the following.

```d1 = HiddenMarkovModel...
d2 = HiddenMarkovModel...
d3 = HiddenMarkovModel...
model = BayesClassifier([d1, d2, d3])
X = np.array([list('HHHHHTHTHTTTTH'),
list('HHTHHTTHHHHHTH'),
list('TH'),
list('HHHHT')])
y = np.array([2, 2, 1, 0])
model.fit(X, y)
```

## API Reference¶

class pomegranate.NaiveBayes.NaiveBayes

A naive Bayes model, a supervised alternative to GMM.

A naive Bayes classifier, that treats each dimension independently from each other. This is a simpler version of the Bayes Classifier, that can use any distribution with any covariance structure, including Bayesian networks and hidden Markov models.

Parameters
modelslist

A list of initialized distributions.

weightslist or numpy.ndarray or None, default None

The prior probabilities of the components. If None is passed in then defaults to the uniformly distributed priors.

Examples

```>>> from pomegranate import *
>>> X = [0, 2, 0, 1, 0, 5, 6, 5, 7, 6]
>>> y = [0, 0, 0, 0, 0, 1, 1, 0, 1, 1]
>>> clf = NaiveBayes.from_samples(NormalDistribution, X, y)
>>> clf.predict_proba([6])
array([[0.01973451,  0.98026549]])
```
```>>> from pomegranate import *
>>> clf = NaiveBayes([NormalDistribution(1, 2), NormalDistribution(0, 1)])
>>> clf.predict_log_proba([[0], [1], [2], [-1]])
array([[-1.1836569 , -0.36550972],
[-0.79437677, -0.60122959],
[-0.26751248, -1.4493653],
[-1.09861229, -0.40546511]])
```
Attributes
modelslist

The model objects, either initialized by the user or fit to data.

weightsnumpy.ndarray

The prior probability of each component of the model.

clear_summaries()

Remove the stored sufficient statistics.

Parameters
None
Returns
None
copy()

Return a deep copy of this distribution object.

This object will not be tied to any other distribution or connected in any form.

Parameters
None
Returns
distributionDistribution

A copy of the distribution with the same parameters.

fit()

Fit the Bayes classifier to the data by passing data to its components.

The fit step for a Bayes classifier with purely labeled data is a simple MLE update on the underlying distributions, grouped by the labels. However, in the semi-supervised the model is trained on a mixture of both labeled and unlabeled data, where the unlabeled data uses the label -1. In this setting, EM is used to train the model. The model is initialized using the labeled data and then sufficient statistics are gathered for both the labeled and unlabeled data, combined, and used to update the parameters.

Parameters
Xnumpy.ndarray or list

The dataset to operate on. For most models this is a numpy array with columns corresponding to features and rows corresponding to samples. For markov chains and HMMs this will be a list of variable length sequences.

ynumpy.ndarray or list or None

Data labels for supervised training algorithms.

weightsarray-like or None, shape (n_samples,), optional

The initial weights of each sample in the matrix. If nothing is passed in then each sample is assumed to be the same weight. Default is None.

Inertia used for the training the distributions.

pseudocountdouble, optional

A pseudocount to add to the emission of each distribution. This effectively smoothes the states to prevent 0. probability symbols if they don’t happen to occur in the data. Default is 0.

stop_thresholddouble, optional, positive

The threshold at which EM will terminate for the improvement of the model. If the model does not improve its fit of the data by a log probability of 0.1 then terminate. Only required if doing semisupervised learning. Default is 0.1.

max_iterationsint, optional, positive

The maximum number of iterations to run EM for. If this limit is hit then it will terminate training, regardless of how well the model is improving per iteration. Only required if doing semisupervised learning. Default is 1e8.

callbackslist, optional

A list of callback objects that describe functionality that should be undertaken over the course of training. Only used for semi-supervised learning.

return_historybool, optional

Whether to return the history during training as well as the model. Only used for semi-supervised learning.

verbosebool, optional

Whether or not to print out improvement information over iterations. Only required if doing semisupervised learning. Default is False.

n_jobsint

The number of jobs to use to parallelize, either the number of threads or the number of processes to use. -1 means use all available resources. Default is 1.

Returns
selfobject

Returns the fitted model

freeze()

Freeze the distribution, preventing updates from occurring.

from_dict()

Deserialize this object from a dictionary of parameters.

from_json()

Deserialize this object from its JSON representation.

Parameters
sstr

A JSON formatted string containing the file.

Returns
modelobject

A properly initialized and baked model.

from_samples()

Create a naive Bayes classifier directly from the given dataset.

This will initialize the distributions using maximum likelihood estimates derived by partitioning the dataset using the label vector. If any labels are missing, the model will be trained using EM in a semi-supervised setting.

A homogeneous model can be defined by passing in a single distribution callable as the first parameter and specifying the number of components, while a heterogeneous model can be defined by passing in a list of callables of the appropriate type.

A naive Bayes classifier is a subrset of the Bayes classifier in that the math is identical, but the distributions are independent for each feature. Simply put, one can create a multivariate Gaussian Bayes classifier with a full covariance matrix, but a Gaussian naive Bayes would require a diagonal covariance matrix.

Parameters
distributionsarray-like, shape (n_components,) or callable

The components of the model. This should either be a single callable if all components will be the same distribution, or an array of callables, one for each feature.

Xarray-like or generator, shape (n_samples, n_dimensions)

This is the data to train on. Each row is a sample, and each column is a dimension to train on.

yarray-like, shape (n_samples,)

The labels for each sample. The labels should be integers between 0 and k-1 for a problem with k classes, or -1 if the label is not known for that sample.

weightsarray-like, shape (n_samples,), optional

The initial weights of each sample in the matrix. If nothing is passed in then each sample is assumed to be the same weight. Default is None.

pseudocountdouble, optional, positive

A pseudocount to add to the emission of each distribution. This effectively smoothes the states to prevent 0. probability symbols if they don’t happen to occur in the data. Only effects mixture models defined over discrete distributions. Default is 0.

stop_thresholddouble, optional, positive

The threshold at which EM will terminate for the improvement of the model. If the model does not improve its fit of the data by a log probability of 0.1 then terminate. Only required if doing semisupervised learning. Default is 0.1.

max_iterationsint, optional, positive

The maximum number of iterations to run EM for. If this limit is hit then it will terminate training, regardless of how well the model is improving per iteration. Only required if doing semisupervised learning. Default is 1e8.

callbackslist, optional

A list of callback objects that describe functionality that should be undertaken over the course of training.

return_historybool, optional

Whether to return the history during training as well as the model.

verbosebool, optional

Whether or not to print out improvement information over iterations. Only required if doing semisupervised learning. Default is False.

n_jobsint

The number of jobs to use to parallelize, either the number of threads or the number of processes to use. Default is 1.

Returns
modelNaiveBayes

The fit naive Bayes model.

from_summaries()

Fit the model to the collected sufficient statistics.

Fit the parameters of the model to the sufficient statistics gathered during the summarize calls. This should return an exact update.

Parameters

The weight of the previous parameters of the model. The new parameters will roughly be old_param*inertia + new_param*(1-inertia), so an inertia of 0 means ignore the old parameters, whereas an inertia of 1 means ignore the new parameters. Default is 0.0.

pseudocountdouble, optional

A pseudocount to add to the emission of each distribution. This effectively smoothes the states to prevent 0. probability symbols if they don’t happen to occur in the data. If discrete data, will smooth both the prior probabilities of each component and the emissions of each component. Otherwise, will only smooth the prior probabilities of each component. Default is 0.

Returns
None
from_yaml()

Deserialize this object from its YAML representation.

log_probability()

Calculate the log probability of a point under the distribution.

The probability of a point is the sum of the probabilities of each distribution multiplied by the weights. Thus, the log probability is the sum of the log probability plus the log prior.

This is the python interface.

Parameters
Xnumpy.ndarray, shape=(n, d) or (n, m, d)

The samples to calculate the log probability of. Each row is a sample and each column is a dimension. If emissions are HMMs then shape is (n, m, d) where m is variable length for each observation, and X becomes an array of n (m, d)-shaped arrays.

n_jobsint, optional

The number of jobs to use to parallelize, either the number of threads or the number of processes to use. -1 means use all available resources. Default is 1.

batch_size: int or None, optional

The size of the batches to make predictions on. Passing in None means splitting the data set evenly among the number of jobs. Default is None.

Returns
log_probabilitydouble

The log probability of the point under the distribution.

predict()

Predict the most likely component which generated each sample.

Calculate the posterior P(M|D) for each sample and return the index of the component most likely to fit it. This corresponds to a simple argmax over the responsibility matrix.

This is a sklearn wrapper for the maximum_a_posteriori method.

Parameters
Xarray-like, shape (n_samples, n_dimensions)

The samples to do the prediction on. Each sample is a row and each column corresponds to a dimension in that sample. For univariate distributions, a single array may be passed in.

n_jobsint

The number of jobs to use to parallelize, either the number of threads or the number of processes to use. -1 means use all available resources. Default is 1.

batch_size: int or None, optional

The size of the batches to make predictions on. Passing in None means splitting the data set evenly among the number of jobs. Default is None.

Returns
yarray-like, shape (n_samples,)

The predicted component which fits the sample the best.

predict_log_proba()

Calculate the posterior log P(M|D) for data.

Calculate the log probability of each item having been generated from each component in the model. This returns normalized log probabilities such that the probabilities should sum to 1

This is a sklearn wrapper for the original posterior function.

Parameters
Xarray-like, shape (n_samples, n_dimensions)

The samples to do the prediction on. Each sample is a row and each column corresponds to a dimension in that sample. For univariate distributions, a single array may be passed in.

n_jobsint

The number of jobs to use to parallelize, either the number of threads or the number of processes to use. -1 means use all available resources. Default is 1.

batch_size: int or None, optional

The size of the batches to make predictions on. Passing in None means splitting the data set evenly among the number of jobs. Default is None.

Returns
yarray-like, shape (n_samples, n_components)

The normalized log probability log P(M|D) for each sample. This is the probability that the sample was generated from each component.

predict_proba()

Calculate the posterior P(M|D) for data.

Calculate the probability of each item having been generated from each component in the model. This returns normalized probabilities such that each row should sum to 1.

Since calculating the log probability is much faster, this is just a wrapper which exponentiates the log probability matrix.

Parameters
Xarray-like, shape (n_samples, n_dimensions)

The samples to do the prediction on. Each sample is a row and each column corresponds to a dimension in that sample. For univariate distributions, a single array may be passed in.

n_jobsint

The number of jobs to use to parallelize, either the number of threads or the number of processes to use. -1 means use all available resources. Default is 1.

batch_size: int or None, optional

The size of the batches to make predictions on. Passing in None means splitting the data set evenly among the number of jobs. Default is None.

Returns
probabilityarray-like, shape (n_samples, n_components)

The normalized probability P(M|D) for each sample. This is the probability that the sample was generated from each component.

probability()

Return the probability of the given symbol under this distribution.

Parameters
symbolobject

The symbol to calculate the probability of

Returns
probabilitydouble

The probability of that point under the distribution.

sample()

Generate a sample from the model.

First, randomly select a component weighted by the prior probability, Then, use the sample method from that component to generate a sample.

Parameters
nint, optional

The number of samples to generate. Defaults to 1.

random_stateint, numpy.random.RandomState, or None

The random state used for generating samples. If set to none, a random seed will be used. If set to either an integer or a random seed, will produce deterministic outputs.

Returns
samplearray-like or object

A randomly generated sample from the model of the type modelled by the emissions. An integer if using most distributions, or an array if using multivariate ones, or a string for most discrete distributions. If n=1 return an object, if n>1 return an array of the samples.

score()

Return the accuracy of the model on a data set.

Parameters
Xnumpy.ndarray, shape=(n, d)

The values of the data set

ynumpy.ndarray, shape=(n,)

The labels of each value

summarize()

Summarize data into stored sufficient statistics for out-of-core training.

Parameters
Xarray-like, shape (n_samples, variable)

Array of the samples, which can be either fixed size or variable depending on the underlying components.

yarray-like, shape (n_samples,)

Array of the known labels as integers

weightsarray-like, shape (n_samples,) optional

Array of the weight of each sample, a positive float

Returns
None
thaw()

Thaw the distribution, re-allowing updates to occur.

to_dict()

Serialize this object to a dictionary of parameters.

to_json()

Serialize the model to JSON.

Parameters
separatorstuple, optional

The two separators to pass to the json.dumps function for formatting. Default is (‘,’, ‘ : ‘).

indentint, optional

The indentation to use at each level. Passed to json.dumps for formatting. Default is 4.

Returns
jsonstr

A properly formatted JSON object.

to_yaml()

Serialize the model to YAML for compactness.

class pomegranate.BayesClassifier.BayesClassifier

A Bayes classifier, a more general form of a naive Bayes classifier.

A Bayes classifier, like a naive Bayes classifier, uses Bayes’ rule in order to calculate the posterior probability of the classes, which are used for the predictions. However, a naive Bayes classifier assumes that each of the features are independent of each other and so can be modelled as independent distributions. A generalization of that, the Bayes classifier, allows for an arbitrary covariance between the features. This allows for more complicated components to be used, up to and including even HMMs to form a classifier over sequences, or mixtures to form a classifier with complex emissions.

Parameters
modelslist

A list of initialized distribution objects to use as the components in the model.

weightslist or numpy.ndarray or None, default None

The prior probabilities of the components. If None is passed in then defaults to the uniformly distributed priors.

Examples

```>>> from pomegranate import *
>>>
>>> d1 = NormalDistribution(3, 2)
>>> d2 = NormalDistribution(5, 1.5)
>>>
>>> clf = BayesClassifier([d1, d2])
>>> clf.predict_proba([[6]])
array([[ 0.2331767,  0.7668233]])
>>> X = [[0], [2], [0], [1], [0], [5], [6], [5], [7], [6]]
>>> y = [0, 0, 0, 0, 0, 1, 1, 0, 1, 1]
>>> clf.fit(X, y)
>>> clf.predict_proba([[6]])
array([[ 0.01973451,  0.98026549]])
```
Attributes
modelslist

The model objects, either initialized by the user or fit to data.

weightsnumpy.ndarray

The prior probability of each component of the model.

clear_summaries()

Remove the stored sufficient statistics.

Parameters
None
Returns
None
copy()

Return a deep copy of this distribution object.

This object will not be tied to any other distribution or connected in any form.

Parameters
None
Returns
distributionDistribution

A copy of the distribution with the same parameters.

fit()

Fit the Bayes classifier to the data by passing data to its components.

The fit step for a Bayes classifier with purely labeled data is a simple MLE update on the underlying distributions, grouped by the labels. However, in the semi-supervised the model is trained on a mixture of both labeled and unlabeled data, where the unlabeled data uses the label -1. In this setting, EM is used to train the model. The model is initialized using the labeled data and then sufficient statistics are gathered for both the labeled and unlabeled data, combined, and used to update the parameters.

Parameters
Xnumpy.ndarray or list

The dataset to operate on. For most models this is a numpy array with columns corresponding to features and rows corresponding to samples. For markov chains and HMMs this will be a list of variable length sequences.

ynumpy.ndarray or list or None

Data labels for supervised training algorithms.

weightsarray-like or None, shape (n_samples,), optional

The initial weights of each sample in the matrix. If nothing is passed in then each sample is assumed to be the same weight. Default is None.

Inertia used for the training the distributions.

pseudocountdouble, optional

A pseudocount to add to the emission of each distribution. This effectively smoothes the states to prevent 0. probability symbols if they don’t happen to occur in the data. Default is 0.

stop_thresholddouble, optional, positive

The threshold at which EM will terminate for the improvement of the model. If the model does not improve its fit of the data by a log probability of 0.1 then terminate. Only required if doing semisupervised learning. Default is 0.1.

max_iterationsint, optional, positive

The maximum number of iterations to run EM for. If this limit is hit then it will terminate training, regardless of how well the model is improving per iteration. Only required if doing semisupervised learning. Default is 1e8.

callbackslist, optional

A list of callback objects that describe functionality that should be undertaken over the course of training. Only used for semi-supervised learning.

return_historybool, optional

Whether to return the history during training as well as the model. Only used for semi-supervised learning.

verbosebool, optional

Whether or not to print out improvement information over iterations. Only required if doing semisupervised learning. Default is False.

n_jobsint

The number of jobs to use to parallelize, either the number of threads or the number of processes to use. -1 means use all available resources. Default is 1.

Returns
selfobject

Returns the fitted model

freeze()

Freeze the distribution, preventing updates from occurring.

from_dict()

Deserialize this object from a dictionary of parameters.

from_json()

Deserialize this object from its JSON representation.

Parameters
sstr

A JSON formatted string containing the file.

Returns
modelobject

A properly initialized and baked model.

from_samples()

Create a Bayes classifier directly from the given dataset.

This will initialize the distributions using maximum likelihood estimates derived by partitioning the dataset using the label vector. If any labels are missing, the model will be trained using EM in a semi-supervised setting.

A homogeneous model can be defined by passing in a single distribution callable as the first parameter and specifying the number of components, while a heterogeneous model can be defined by passing in a list of callables of the appropriate type.

A Bayes classifier is a superset of the naive Bayes classifier in that the math is identical, but the distributions used do not have to be independent for each feature. Simply put, one can create a multivariate Gaussian Bayes classifier with a full covariance matrix, but a Gaussian naive Bayes would require a diagonal covariance matrix.

Parameters
distributionsarray-like, shape (n_components,) or callable

The components of the model. This should either be a single callable if all components will be the same distribution, or an array of callables, one for each feature.

Xarray-like, shape (n_samples, n_dimensions)

This is the data to train on. Each row is a sample, and each column is a dimension to train on.

yarray-like, shape (n_samples,)

The labels for each sample. The labels should be integers between 0 and k-1 for a problem with k classes, or -1 if the label is not known for that sample.

weightsarray-like, shape (n_samples,), optional

The initial weights of each sample in the matrix. If nothing is passed in then each sample is assumed to be the same weight. Default is None.

Inertia used for the training the distributions.

pseudocountdouble, optional

A pseudocount to add to the emission of each distribution. This effectively smoothes the states to prevent 0. probability symbols if they don’t happen to occur in the data. Default is 0.

stop_thresholddouble, optional, positive

The threshold at which EM will terminate for the improvement of the model. If the model does not improve its fit of the data by a log probability of 0.1 then terminate. Only required if doing semisupervised learning. Default is 0.1.

max_iterationsint, optional, positive

The maximum number of iterations to run EM for. If this limit is hit then it will terminate training, regardless of how well the model is improving per iteration. Only required if doing semisupervised learning. Default is 1e8.

callbackslist, optional

A list of callback objects that describe functionality that should be undertaken over the course of training.

return_historybool, optional

Whether to return the history during training as well as the model.

keyslist

A list of sets where each set is the keys present in that column. If there are d columns in the data set then this list should have d sets and each set should have at least two keys in it.

verbosebool, optional

Whether or not to print out improvement information over iterations. Only required if doing semisupervised learning. Default is False.

n_jobsint, optional

The number of jobs to use to parallelize, either the number of threads or the number of processes to use. -1 means use all available resources. Default is 1.

**kwargsdict, optional

Any arguments to pass into the from_samples methods of other objects that are being created such as BayesianNetworks or HMMs.

Returns
modelBayesClassifier

The fit Bayes classifier model.

from_summaries()

Fit the model to the collected sufficient statistics.

Fit the parameters of the model to the sufficient statistics gathered during the summarize calls. This should return an exact update.

Parameters

The weight of the previous parameters of the model. The new parameters will roughly be old_param*inertia + new_param*(1-inertia), so an inertia of 0 means ignore the old parameters, whereas an inertia of 1 means ignore the new parameters. Default is 0.0.

pseudocountdouble, optional

A pseudocount to add to the emission of each distribution. This effectively smoothes the states to prevent 0. probability symbols if they don’t happen to occur in the data. If discrete data, will smooth both the prior probabilities of each component and the emissions of each component. Otherwise, will only smooth the prior probabilities of each component. Default is 0.

Returns
None
from_yaml()

Deserialize this object from its YAML representation.

log_probability()

Calculate the log probability of a point under the distribution.

The probability of a point is the sum of the probabilities of each distribution multiplied by the weights. Thus, the log probability is the sum of the log probability plus the log prior.

This is the python interface.

Parameters
Xnumpy.ndarray, shape=(n, d) or (n, m, d)

The samples to calculate the log probability of. Each row is a sample and each column is a dimension. If emissions are HMMs then shape is (n, m, d) where m is variable length for each observation, and X becomes an array of n (m, d)-shaped arrays.

n_jobsint, optional

The number of jobs to use to parallelize, either the number of threads or the number of processes to use. -1 means use all available resources. Default is 1.

batch_size: int or None, optional

The size of the batches to make predictions on. Passing in None means splitting the data set evenly among the number of jobs. Default is None.

Returns
log_probabilitydouble

The log probability of the point under the distribution.

predict()

Predict the most likely component which generated each sample.

Calculate the posterior P(M|D) for each sample and return the index of the component most likely to fit it. This corresponds to a simple argmax over the responsibility matrix.

This is a sklearn wrapper for the maximum_a_posteriori method.

Parameters
Xarray-like, shape (n_samples, n_dimensions)

The samples to do the prediction on. Each sample is a row and each column corresponds to a dimension in that sample. For univariate distributions, a single array may be passed in.

n_jobsint

The number of jobs to use to parallelize, either the number of threads or the number of processes to use. -1 means use all available resources. Default is 1.

batch_size: int or None, optional

The size of the batches to make predictions on. Passing in None means splitting the data set evenly among the number of jobs. Default is None.

Returns
yarray-like, shape (n_samples,)

The predicted component which fits the sample the best.

predict_log_proba()

Calculate the posterior log P(M|D) for data.

Calculate the log probability of each item having been generated from each component in the model. This returns normalized log probabilities such that the probabilities should sum to 1

This is a sklearn wrapper for the original posterior function.

Parameters
Xarray-like, shape (n_samples, n_dimensions)

The samples to do the prediction on. Each sample is a row and each column corresponds to a dimension in that sample. For univariate distributions, a single array may be passed in.

n_jobsint

The number of jobs to use to parallelize, either the number of threads or the number of processes to use. -1 means use all available resources. Default is 1.

batch_size: int or None, optional

The size of the batches to make predictions on. Passing in None means splitting the data set evenly among the number of jobs. Default is None.

Returns
yarray-like, shape (n_samples, n_components)

The normalized log probability log P(M|D) for each sample. This is the probability that the sample was generated from each component.

predict_proba()

Calculate the posterior P(M|D) for data.

Calculate the probability of each item having been generated from each component in the model. This returns normalized probabilities such that each row should sum to 1.

Since calculating the log probability is much faster, this is just a wrapper which exponentiates the log probability matrix.

Parameters
Xarray-like, shape (n_samples, n_dimensions)

The samples to do the prediction on. Each sample is a row and each column corresponds to a dimension in that sample. For univariate distributions, a single array may be passed in.

n_jobsint

The number of jobs to use to parallelize, either the number of threads or the number of processes to use. -1 means use all available resources. Default is 1.

batch_size: int or None, optional

The size of the batches to make predictions on. Passing in None means splitting the data set evenly among the number of jobs. Default is None.

Returns
probabilityarray-like, shape (n_samples, n_components)

The normalized probability P(M|D) for each sample. This is the probability that the sample was generated from each component.

probability()

Return the probability of the given symbol under this distribution.

Parameters
symbolobject

The symbol to calculate the probability of

Returns
probabilitydouble

The probability of that point under the distribution.

sample()

Generate a sample from the model.

First, randomly select a component weighted by the prior probability, Then, use the sample method from that component to generate a sample.

Parameters
nint, optional

The number of samples to generate. Defaults to 1.

random_stateint, numpy.random.RandomState, or None

The random state used for generating samples. If set to none, a random seed will be used. If set to either an integer or a random seed, will produce deterministic outputs.

Returns
samplearray-like or object

A randomly generated sample from the model of the type modelled by the emissions. An integer if using most distributions, or an array if using multivariate ones, or a string for most discrete distributions. If n=1 return an object, if n>1 return an array of the samples.

score()

Return the accuracy of the model on a data set.

Parameters
Xnumpy.ndarray, shape=(n, d)

The values of the data set

ynumpy.ndarray, shape=(n,)

The labels of each value

summarize()

Summarize data into stored sufficient statistics for out-of-core training.

Parameters
Xarray-like, shape (n_samples, variable)

Array of the samples, which can be either fixed size or variable depending on the underlying components.

yarray-like, shape (n_samples,)

Array of the known labels as integers

weightsarray-like, shape (n_samples,) optional

Array of the weight of each sample, a positive float

Returns
None
thaw()

Thaw the distribution, re-allowing updates to occur.

to_dict()

Serialize this object to a dictionary of parameters.

to_json()

Serialize the model to JSON.

Parameters
separatorstuple, optional

The two separators to pass to the json.dumps function for formatting. Default is (‘,’, ‘ : ‘).

indentint, optional

The indentation to use at each level. Passed to json.dumps for formatting. Default is 4.

Returns
jsonstr

A properly formatted JSON object.

to_yaml()

Serialize the model to YAML for compactness.