Bayesian Networks

IPython Notebook Tutorial

Bayesian networks are a powerful inference tool, in which nodes represent some random variable we care about, edges represent dependencies and a lack of an edge between two nodes represents a conditional independence. A powerful algorithm called the sum-product or forward-backward algorithm allows for inference to be done on this network, calculating posteriors on unobserved (“hidden”) variables when limited information is given. The more information is known, the better the inference will be, but there is no requirement on the number of nodes which must be observed. If no information is given, the marginal of the graph is trivially calculated. The hidden and observed variables do not need to be explicitly defined when the network is set, they simply exist based on what information is given.

Lets test out the Bayesian Network framework on the Monty Hall problem. The Monty Hall problem arose from the gameshow Let’s Make a Deal, where a guest had to choose which one of three doors had a prize behind it. The twist was that after the guest chose, the host, originally Monty Hall, would then open one of the doors the guest did not pick and ask if the guest wanted to switch which door they had picked. Initial inspection may lead you to believe that if there are only two doors left, there is a 50-50 chance of you picking the right one, and so there is no advantage one way or the other. However, it has been proven both through simulations and analytically that there is in fact a 66% chance of getting the prize if the guest switches their door, regardless of the door they initially went with.

We can reproduce this result using Bayesian networks with three nodes, one for the guest, one for the prize, and one for the door Monty chooses to open. The door the guest initially chooses and the door the prize is behind are completely random processes across the three doors, but the door which Monty opens is dependent on both the door the guest chooses (it cannot be the door the guest chooses), and the door the prize is behind (it cannot be the door with the prize behind it).

import math
from pomegranate import *

# The guests initial door selection is completely random
guest = DiscreteDistribution( { 'A': 1./3, 'B': 1./3, 'C': 1./3 } )

# The door the prize is behind is also completely random
prize = DiscreteDistribution( { 'A': 1./3, 'B': 1./3, 'C': 1./3 } )

        # Monty is dependent on both the guest and the prize.
        monty = ConditionalProbabilityTable(
                [[ 'A', 'A', 'A', 0.0 ],
                 [ 'A', 'A', 'B', 0.5 ],
                 [ 'A', 'A', 'C', 0.5 ],
                 [ 'A', 'B', 'A', 0.0 ],
                 [ 'A', 'B', 'B', 0.0 ],
                 [ 'A', 'B', 'C', 1.0 ],
                 [ 'A', 'C', 'A', 0.0 ],
                 [ 'A', 'C', 'B', 1.0 ],
                 [ 'A', 'C', 'C', 0.0 ],
                 [ 'B', 'A', 'A', 0.0 ],
                 [ 'B', 'A', 'B', 0.0 ],
                 [ 'B', 'A', 'C', 1.0 ],
                 [ 'B', 'B', 'A', 0.5 ],
                 [ 'B', 'B', 'B', 0.0 ],
                 [ 'B', 'B', 'C', 0.5 ],
                 [ 'B', 'C', 'A', 1.0 ],
                 [ 'B', 'C', 'B', 0.0 ],
                 [ 'B', 'C', 'C', 0.0 ],
                 [ 'C', 'A', 'A', 0.0 ],
                 [ 'C', 'A', 'B', 1.0 ],
                 [ 'C', 'A', 'C', 0.0 ],
                 [ 'C', 'B', 'A', 1.0 ],
                 [ 'C', 'B', 'B', 0.0 ],
                 [ 'C', 'B', 'C', 0.0 ],
                 [ 'C', 'C', 'A', 0.5 ],
                 [ 'C', 'C', 'B', 0.5 ],
                 [ 'C', 'C', 'C', 0.0 ]], [guest, prize] )

s1 = State( guest, name="guest" )
s2 = State( prize, name="prize" )
s3 = State( monty, name="monty" )

network = BayesianNetwork( "Monty Hall Problem" )
network.add_states(s1, s2, s3)
network.add_edge(s1, s3)
network.add_edge(s2, s3)
network.bake()

Bayesian Networks utilize ConditionalProbabilityTable objects to represent conditional distributions. This distribution is made up of a table where each column represents the parent (or self) values except for the last column which represents the probability of the variable taking on that value given its parent values. It also takes in a list of parent distribution objects in the same order that they are used in the table. In the Monty Hall example, the monty distribution is dependent on both the guest and the prize distributions in that order and so the first column of the CPT is the value the guest takes and the second column is the value that the prize takes.

The next step is to make predictions using this model. One of the strengths of Bayesian networks is their ability to infer the values of arbitrary ‘hidden variables’ given the values from ‘observed variables.’ These hidden and observed variables do not need to be specified beforehand, and the more variables which are observed the better the inference will be on the hidden variables.

Lets say that the guest chooses door ‘A’. guest becomes an observed variable, while both prize and monty are hidden variables.

... code-block:: python

>>> beliefs = network.predict_proba({ 'guest' : 'A' })
>>> beliefs = map(str, beliefs)
>>> print "\n".join( "{}\t{}".format( state.name, belief ) for state, belief in zip( network.states, beliefs ) )
prize   DiscreteDistribution({'A': 0.3333333333333335, 'C': 0.3333333333333333, 'B': 0.3333333333333333})
guest   DiscreteDistribution({'A': 1.0, 'C': 0.0, 'B': 0.0})
monty   DiscreteDistribution({'A': 0.0, 'C': 0.5, 'B': 0.5})

Since we’ve observed the value that guest takes, we know there is a 100% chance it is that value. The prize distribution is unaffected because it is independent of the guest variable given that we don’t know the door that Monty opens.

Now the next step is for Monty to open a door. Let’s say that Monty opens door ‘b’:

>>> beliefs = network.predict_proba({'guest' : 'A', 'monty' : 'B'})
>>> print "\n".join( "{}\t{}".format( state.name, str(belief) ) for state, belief in zip( network.states, beliefs ) )
guest   DiscreteDistribution({'A': 1.0, 'C': 0.0, 'B': 0.0})
monty   DiscreteDistribution({'A': 0.0, 'C': 0.0, 'B': 1.0})
prize   DiscreteDistribution({'A': 0.3333333333333333, 'C': 0.6666666666666666, 'B': 0.0})

We’ve observed both guest and Monty so there is a 100% chance for those values. However, we see that probability of prize being ‘C’ is 66% mimicking the mystery behind the Monty hall problem!

API Reference

class pomegranate.BayesianNetwork.BayesianNetwork

A Bayesian Network Model.

A Bayesian network is a directed graph where nodes represent variables, edges represent conditional dependencies of the children on their parents, and the lack of an edge represents a conditional independence.

Parameters:

name : str, optional

The name of the model. Default is None

Attributes

states (list, shape (n_states,)) A list of all the state objects in the model
graph (networkx.DiGraph) The underlying graph object.
add_edge()

Add a transition from state a to state b which indicates that B is dependent on A in ways specified by the distribution.

add_node()

Add a node to the graph.

add_nodes()

Add multiple states to the graph.

add_state()

Another name for a node.

add_states()

Another name for a node.

add_transition()

Transitions and edges are the same.

bake()

Finalize the topology of the model.

Assign a numerical index to every state and create the underlying arrays corresponding to the states and edges between the states. This method must be called before any of the probability-calculating methods. This includes converting conditional probability tables into joint probability tables and creating a list of both marginal and table nodes.

Parameters:None
Returns:None
clear_summaries()

Clear the summary statistics stored in the object.

copy()

Return a deep copy of this distribution object.

This object will not be tied to any other distribution or connected in any form.

Parameters:

None

Returns:

distribution : Distribution

A copy of the distribution with the same parameters.

dense_transition_matrix()

Returns the dense transition matrix. Useful if the transitions of somewhat small models need to be analyzed.

edge_count()

Returns the number of edges present in the model.

fit()

Fit the model to data using MLE estimates.

Fit the model to the data by updating each of the components of the model, which are univariate or multivariate distributions. This uses a simple MLE estimate to update the distributions according to their summarize or fit methods.

This is a wrapper for the summarize and from_summaries methods.

Parameters:

items : array-like, shape (n_samples, n_nodes)

The data to train on, where each row is a sample and each column corresponds to the associated variable.

weights : array-like, shape (n_nodes), optional

The weight of each sample as a positive double. Default is None.

inertia : double, optional

The inertia for updating the distributions, passed along to the distribution method. Default is 0.0.

Returns:

None

freeze()

Freeze the distribution, preventing updates from occuring.

from_json()

Read in a serialized Bayesian Network and return the appropriate object.

Parameters:

s : str

A JSON formatted string containing the file.

Returns:

model : object

A properly initialized and baked model.

from_samples()

Learn the structure of the network from data.

Find the structure of the network from data using a Bayesian structure learning score. This currently enumerates all the exponential number of structures and finds the best according to the score. This allows weights on the different samples as well.

Parameters:

X : array-like, shape (n_samples, n_nodes)

The data to fit the structure too, where each row is a sample and each column corresponds to the associated variable.

weights : array-like, shape (n_nodes), optional

The weight of each sample as a positive double. Default is None.

algorithm : str, one of ‘chow-liu’, ‘exact’ optional

The algorithm to use for learning the Bayesian network. Default is ‘chow-liu’ which returns a tree structure.

max_parents : int, optional

The maximum number of parents a node can have. If used, this means using the k-learn procedure. Can drastically speed up algorithms. If -1, no max on parents. Default is -1.

root : int, optional

For algorithms which require a single root (‘chow-liu’), this is the root for which all edges point away from. User may specify which column to use as the root. Default is the first column.

constraint_graph : networkx.DiGraph or None, optional

A directed graph showing valid parent sets for each variable. Each node is a set of variables, and edges represent which variables can be valid parents of those variables. The naive structure learning task is just all variables in a single node with a self edge, meaning that you know nothing about

pseudocount : double, optional

A pseudocount to add to each possibility.

Returns:

model : BayesianNetwork

The learned BayesianNetwork.

from_structure()

Return a Bayesian network from a predefined structure.

Pass in the structure of the network as a tuple of tuples and get a fit network in return. The tuple should contain n tuples, with one for each node in the graph. Each inner tuple should be of the parents for that node. For example, a three node graph where both node 0 and 1 have node 2 as a parent would be specified as ((2,), (2,), ()).

Parameters:

X : array-like, shape (n_samples, n_nodes)

The data to fit the structure too, where each row is a sample and each column corresponds to the associated variable.

structure : tuple of tuples

The parents for each node in the graph. If a node has no parents, then do not specify any parents.

weights : array-like, shape (n_nodes), optional

The weight of each sample as a positive double. Default is None.

name : str, optional

The name of the model. Default is None.

Returns:

model : BayesianNetwoork

A Bayesian network with the specified structure.

from_summaries()

Use MLE on the stored sufficient statistics to train the model.

This uses MLE estimates on the stored sufficient statistics to train the model.

Parameters:

inertia : double, optional

The inertia for updating the distributions, passed along to the distribution method. Default is 0.0.

Returns:

None

log_probability()

Return the log probability of a sample under the Bayesian network model.

The log probability is just the sum of the log probabilities under each of the components. The log probability of a sample under the graph A -> B is just P(A)*P(B|A).

Parameters:

sample : array-like, shape (n_nodes)

The sample is a vector of points where each dimension represents the same variable as added to the graph originally. It doesn’t matter what the connections between these variables are, just that they are all ordered the same.

Returns:

logp : double

The log probability of that sample.

marginal()

Return the marginal probabilities of each variable in the graph.

This is equivalent to a pass of belief propogation on a graph where no data has been given. This will calculate the probability of each variable being in each possible emission when nothing is known.

Parameters:

None

Returns:

marginals : array-like, shape (n_nodes)

An array of univariate distribution objects showing the marginal probabilities of that variable.

node_count()

Returns the number of nodes/states in the model

plot()

Draw this model’s graph using NetworkX and matplotlib.

Note that this relies on networkx’s built-in graphing capabilities (and not Graphviz) and thus can’t draw self-loops.

See networkx.draw_networkx() for the keywords you can pass in.

Parameters:

**kwargs : any

The arguments to pass into networkx.draw_networkx()

Returns:

None

predict()

Predict missing values of a data matrix using MLE.

Impute the missing values of a data matrix using the maximally likely predictions according to the forward-backward algorithm. Run each sample through the algorithm (predict_proba) and replace missing values with the maximally likely predicted emission.

Parameters:

items : array-like, shape (n_samples, n_nodes)

Data matrix to impute. Missing values must be either None (if lists) or np.nan (if numpy.ndarray). Will fill in these values with the maximally likely ones.

max_iterations : int, optional

Number of iterations to run loopy belief propogation for. Default is 100.

Returns:

items : array-like, shape (n_samples, n_nodes)

This is the data matrix with the missing values imputed.

predict_proba()

Returns the probabilities of each variable in the graph given evidence.

This calculates the marginal probability distributions for each state given the evidence provided through loopy belief propogation. Loopy belief propogation is an approximate algorithm which is exact for certain graph structures.

Parameters:

data : dict or array-like, shape <= n_nodes, optional

The evidence supplied to the graph. This can either be a dictionary with keys being state names and values being the observed values (either the emissions or a distribution over the emissions) or an array with the values being ordered according to the nodes incorporation in the graph (the order fed into .add_states/add_nodes) and None for variables which are unknown. If nothing is fed in then calculate the marginal of the graph. Default is {}.

max_iterations : int, optional

The number of iterations with which to do loopy belief propogation. Usually requires only 1. Default is 100.

check_input : bool, optional

Check to make sure that the observed symbol is a valid symbol for that distribution to produce. Default is True.

Returns:

probabilities : array-like, shape (n_nodes)

An array of univariate distribution objects showing the probabilities of each variable.

probability()

Return the probability of the given symbol under this distribution.

Parameters:

symbol : object

The symbol to calculate the probability of

Returns:

probability : double

The probability of that point under the distribution.

sample()

Return a random item sampled from this distribution.

Parameters:

n : int or None, optional

The number of samples to return. Default is None, which is to generate a single sample.

Returns:

sample : double or object

Returns a sample from the distribution of a type in the support of the distribution.

state_count()

Returns the number of states present in the model.

summarize()

Summarize a batch of data and store the sufficient statistics.

This will partition the dataset into columns which belong to their appropriate distribution. If the distribution has parents, then multiple columns are sent to the distribution. This relies mostly on the summarize function of the underlying distribution.

Parameters:

items : array-like, shape (n_samples, n_nodes)

The data to train on, where each row is a sample and each column corresponds to the associated variable.

weights : array-like, shape (n_nodes), optional

The weight of each sample as a positive double. Default is None.

Returns:

None

thaw()

Thaw the distribution, re-allowing updates to occur.

to_json()

Serialize the model to a JSON.

Parameters:

separators : tuple, optional

The two separaters to pass to the json.dumps function for formatting.

indent : int, optional

The indentation to use at each level. Passed to json.dumps for formatting.

Returns:

json : str

A properly formatted JSON object.