Naive Bayes Classifiers

IPython Notebook Tutorial

The Naive Bayes classifier is a simple probabilistic classification model based on Bayes Theorem. Since Naive Bayes classifiers classifies sets of data by which class has the highest conditional probability, Naive Bayes classifiers can use any distribution or model which has a probabilistic interpretation of the data as one of its components. Basically if it can output a log probability, then it can be used in Naive Bayes.

An IPython notebook example demonstrating a Naive Bayes classifier using multivariate distributions can be found here.

Initialization

Naive Bayes can be initialized in two ways, either by (1) passing in pre-initialized models as a list, or by (2) passing in the constructor and the number of components for simple distributions. For example, here is how you can create a Naive bayes classifier which compares a normal distribution to a uniform distribution to an exponential distribution:

>>> from pomegranate import *
>>> clf = NaiveBayes([ NormalDistribution(5, 2), UniformDistribution(0, 10), ExponentialDistribution(1) ])

An advantage of initializing the classifier this way is that you can use pre-trained or known-before-hand models to make predictions. A disadvantage is that if we don’t have any prior knowledge as to what the distributions should be then we have to make up distributions to start off with. If all of the models in the classifier use the same type of model then we can pass in the constructor for that model and the number of classes that there are.

>>> from pomegranate import *
>>> clf = NaiveBayes(NormalDistribution, n_components=5)

Warning

If we initialize a naive Bayes classifier in this manner we must fit the model before we can use it to predict.

An advantage of doing it this way is that we don’t need to make dummy distributions just to train, but a disadvantage is that we have to train the model before we can use it.

Since Naive Bayes classifiers simply compares the likelihood of a sample occurring under different models, it can be initialized with any model in pomegranate. This is assuming that all the models take the same type of input.

>>> from pomegranate import *
>>> d1 = MultivariateGaussianDistribution([5, 5], [[1, 0], [0, 1]])
>>> d2 = IndependentComponentsDistribution([NormalDistribution(5, 2), NormalDistribution(5, 2)])
>>> clf = NaiveBayes([d1, d2])

Note

This is no longer strictly a “naive” Bayes classifier if we are using more complicated models. However, much of the underlying math still holds.

Prediction

Naive Bayes supports the same three prediction methods that the other models support, namely predict, predict_proba, and predict_log_proba. These methods return the most likely class given the data, the probability of each class given the data, and the log probability of each class given the data.

The predict method takes in samples and returns the most likely class given the data.

>>> from pomegranate import *
>>> clf = NaiveBayes([ NormalDistribution( 5, 2 ), UniformDistribution( 0, 10 ), ExponentialDistribution( 1.0 ) ])
>>> clf.predict( np.array([ 0, 1, 2, 3, 4 ]) )
[ 2, 2, 2, 0, 0 ]

Calling predict_proba on five samples for a Naive Bayes with univariate components would look like the following.

>>> from pomegranate import *
>>> clf = NaiveBayes([NormalDistribution(5, 2), UniformDistribution(0, 10), ExponentialDistribution(1)])
>>> clf.predict_proba(np.array([ 0, 1, 2, 3, 4]))
[[ 0.00790443  0.09019051  0.90190506]
 [ 0.05455011  0.20207126  0.74337863]
 [ 0.21579499  0.33322883  0.45097618]
 [ 0.44681566  0.36931382  0.18387052]
 [ 0.59804205  0.33973357  0.06222437]]

Multivariate models work the same way except that the input has to have the same number of columns as are represented in the model, like the following.

>>> from pomegranate import *
>>> d1 = MultivariateGaussianDistribution([5, 5], [[1, 0], [0, 1]])
>>> d2 = IndependentComponentsDistribution([NormalDistribution(5, 2), NormalDistribution(5, 2)])
>>> clf = NaiveBayes([d1, d2])
>>> clf.predict_proba(np.array([[0, 4],
                                                            [1, 3],
                                                            [2, 2],
                                                            [3, 1],
                                                            [4, 0]]))
array([[ 0.00023312,  0.99976688],
       [ 0.00220745,  0.99779255],
       [ 0.00466169,  0.99533831],
       [ 0.00220745,  0.99779255],
       [ 0.00023312,  0.99976688]])

predict_log_proba works in a similar way except that it returns the log probabilities instead of the actual probabilities.

Fitting

Naive Bayes has a fit method, in which the models in the classifier are trained to “fit” to a set of data. The method takes two numpy arrays as input, an array of samples and an array of correct classifications for each sample. Here is an example for a Naive Bayes made up of two bivariate distributions.

>>> from pomegranate import *
>>> d1 = MultivariateGaussianDistribution([5, 5], [[1, 0], [0, 1]])
>>> d2 = IndependentComponentsDistribution(NormalDistribution(5, 2), NormalDistribution(5, 2)])
>>> clf = NaiveBayes([d1, d2])
>>> X = np.array([[6.0, 5.0],
                          [3.5, 4.0],
                          [7.5, 1.5],
                              [7.0, 7.0 ]])
>>> y = np.array([0, 0, 1, 1])
>>> clf.fit(X, y)

As we can see, there are four samples, with the first two samples labeled as class 0 and the last two samples labeled as class 1. Keep in mind that the training samples must match the input requirements for the models used. So if using a univariate distribution, then each sample must contain one item. A bivariate distribution, two. For hidden markov models, the sample can be a list of observations of any length. An example using hidden markov models would be the following.

>>> X = np.array([list( 'HHHHHTHTHTTTTH' ),
                                            list( 'HHTHHTTHHHHHTH' ),
                                            list( 'TH' ),
                                            list( 'HHHHT' )])
>>> y = np.array([2, 2, 1, 0])
>>> clf.fit(X, y)

API Reference

Naive Bayes estimator, for anything with a log_probability method.

class pomegranate.NaiveBayes.NaiveBayes

A Naive Bayes model, a supervised alternative to GMM.

Parameters:

models : list or constructor

Must either be a list of initialized distribution/model objects, or the constructor for a distribution object:

  • Initialized : NaiveBayes([NormalDistribution(1, 2), NormalDistribution(0, 1)])
  • Constructor : NaiveBayes(NormalDistribution)

weights : list or numpy.ndarray or None, default None

The prior probabilities of the components. If None is passed in then defaults to the uniformly distributed priors.

Examples

>>> from pomegranate import *
>>> clf = NaiveBayes( NormalDistribution )
>>> X = [0, 2, 0, 1, 0, 5, 6, 5, 7, 6]
>>> y = [0, 0, 0, 0, 0, 1, 1, 0, 1, 1]
>>> clf.fit(X, y)
>>> clf.predict_proba([6])
array([[ 0.01973451,  0.98026549]])
>>> from pomegranate import *
>>> clf = NaiveBayes([NormalDistribution(1, 2), NormalDistribution(0, 1)])
>>> clf.predict_log_proba([[0], [1], [2], [-1]])
array([[-1.1836569 , -0.36550972],
           [-0.79437677, -0.60122959],
           [-0.26751248, -1.4493653 ],
           [-1.09861229, -0.40546511]])

Attributes

models (list) The model objects, either initialized by the user or fit to data.
weights (numpy.ndarray) The prior probability of each component of the model.
clear_summaries()

Clear the summary statistics stored in the object.

copy()

Return a deep copy of this distribution object.

This object will not be tied to any other distribution or connected in any form.

Parameters:

None

Returns:

distribution : Distribution

A copy of the distribution with the same parameters.

fit()

Fit the Naive Bayes model to the data by passing data to their components.

Parameters:

X : numpy.ndarray or list

The dataset to operate on. For most models this is a numpy array with columns corresponding to features and rows corresponding to samples. For markov chains and HMMs this will be a list of variable length sequences.

y : numpy.ndarray or list or None, optional

Data labels for supervised training algorithms. Default is None

weights : array-like or None, shape (n_samples,), optional

The initial weights of each sample in the matrix. If nothing is passed in then each sample is assumed to be the same weight. Default is None.

n_jobs : int

The number of jobs to use to parallelize, either the number of threads or the number of processes to use. Default is 1.

inertia : double, optional

Inertia used for the training the distributions.

Returns:

self : object

Returns the fitted model

freeze()

Freeze the distribution, preventing updates from occuring.

from_summaries()

Fit the Naive Bayes model to the stored sufficient statistics.

Parameters:

inertia : double, optional

Inertia used for the training the distributions.

Returns:

self : object

Returns the fitted model

log_probability()

Return the log probability of the given symbol under this distribution.

Parameters:

symbol : double

The symbol to calculate the log probability of (overriden for DiscreteDistributions)

Returns:

logp : double

The log probability of that point under the distribution.

predict()

Predict the most likely component which generated each sample.

Calculate the posterior P(M|D) for each sample and return the index of the component most likely to fit it. This corresponds to a simple argmax over the responsibility matrix.

This is a sklearn wrapper for the maximum_a_posteriori method.

Parameters:

X : array-like, shape (n_samples, n_dimensions)

The samples to do the prediction on. Each sample is a row and each column corresponds to a dimension in that sample. For univariate distributions, a single array may be passed in.

Returns:

y : array-like, shape (n_samples,)

The predicted component which fits the sample the best.

predict_log_proba()

Calculate the posterior log P(M|D) for data.

Calculate the log probability of each item having been generated from each component in the model. This returns normalized log probabilities such that the probabilities should sum to 1

This is a sklearn wrapper for the original posterior function.

Parameters:

X : array-like, shape (n_samples, n_dimensions)

The samples to do the prediction on. Each sample is a row and each column corresponds to a dimension in that sample. For univariate distributions, a single array may be passed in.

Returns:

y : array-like, shape (n_samples, n_components)

The normalized log probability log P(M|D) for each sample. This is the probability that the sample was generated from each component.

predict_proba()

Calculate the posterior P(M|D) for data.

Calculate the probability of each item having been generated from each component in the model. This returns normalized probabilities such that each row should sum to 1.

Since calculating the log probability is much faster, this is just a wrapper which exponentiates the log probability matrix.

Parameters:

X : array-like, shape (n_samples, n_dimensions)

The samples to do the prediction on. Each sample is a row and each column corresponds to a dimension in that sample. For univariate distributions, a single array may be passed in.

Returns:

probability : array-like, shape (n_samples, n_components)

The normalized probability P(M|D) for each sample. This is the probability that the sample was generated from each component.

probability()

Return the probability of the given symbol under this distribution.

Parameters:

symbol : object

The symbol to calculate the probability of

Returns:

probability : double

The probability of that point under the distribution.

sample()

Return a random item sampled from this distribution.

Parameters:

n : int or None, optional

The number of samples to return. Default is None, which is to generate a single sample.

Returns:

sample : double or object

Returns a sample from the distribution of a type in the support of the distribution.

summarize()

Summarize data into stored sufficient statistics for out-of-core training.

Parameters:

X : array-like, shape (n_samples, variable)

Array of the samples, which can be either fixed size or variable depending on the underlying components.

y : array-like, shape (n_samples,)

Array of the known labels as integers

weights : array-like, shape (n_samples,) optional

Array of the weight of each sample, a positive float

n_jobs : int

The number of jobs to use to parallelize, either the number of threads or the number of processes to use. Default is 1.

Returns:

None

thaw()

Thaw the distribution, re-allowing updates to occur.