Bayesian Networks

author: Jacob Schreiber contact: jmschreiber91@gmail.com

Bayesian networks are a general-purpose probabilistic model that are a superset of all others presented in pomegranate. Specifically, Bayesian networks are a way of factorizing a joint probability distribution across a graph structure, where the presence of an edge represents a directed dependency between two variables and the lack of an edge represents a conditional independence. Importantly, the lack of an edge does not represent true independence of two variables by itself, but a conditional independence.

[1]:
%pylab inline
import seaborn; seaborn.set_style('whitegrid')
import torch

%load_ext watermark
%watermark -m -n -p torch,pomegranate
Populating the interactive namespace from numpy and matplotlib
torch      : 1.13.0
pomegranate: 1.0.0

Compiler    : GCC 11.2.0
OS          : Linux
Release     : 4.15.0-208-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 8
Architecture: 64bit

Initialization and Fitting

Similar to the hidden Markov model, the Bayesian network is comprised of a set of distributions and a graph structure connecting them. In this case, the graph is just a series of directed unweighted edges. Most Bayesian networks require that this graph is acyclic. However, becase pomegranate uses a factor graph to do inference, there is no strict requirement that this is the case. See the inference sections below.

Likewise, similar to the other models in pomegranate, a Bayesian network can be learned in its entirety from data. However, exact structure learning is intractable and so the field has developed a variety of approximations. See the Bayesian network structure learning tutorial for more.

For now, let’s implement the simplest Bayesian network: a child and a parent.

[2]:
from pomegranate.distributions import Categorical
from pomegranate.distributions import ConditionalCategorical
from pomegranate.bayesian_network import BayesianNetwork

d1 = Categorical([[0.1, 0.9]])
d2 = ConditionalCategorical([[[0.4, 0.6], [0.3, 0.7]]])

model = BayesianNetwork([d1, d2], [(d1, d2)])

Alternatively, one can use the add_distributions and add_edge method to create the network programmatically.

[3]:
model2 = BayesianNetwork()
model2.add_distributions([d1, d2])
model2.add_edge(d1, d2)

Once these models are initialized with a structue, they can be fit to data.

[4]:
X = numpy.random.randint(2, size=(10, 2))
X
[4]:
array([[0, 0],
       [0, 0],
       [0, 1],
       [0, 0],
       [0, 1],
       [0, 1],
       [0, 0],
       [1, 1],
       [1, 0],
       [0, 1]])
[5]:
model.fit(X)
[5]:
BayesianNetwork(
  (distributions): ModuleList(
    (0): Categorical()
    (1): ConditionalCategorical(
      (probs): ParameterList(  (0): Parameter containing: [torch.float32 of size 2x2])
      (_w_sum): [tensor([0., 0.])]
      (_xw_sum): [tensor([[0., 0.],
              [0., 0.]])]
      (_log_probs): [tensor([[-0.6931, -0.6931],
              [-0.6931, -0.6931]])]
    )
  )
  (_factor_graph): FactorGraph(
    (factors): ModuleList(
      (0): Categorical()
      (1): JointCategorical()
    )
    (marginals): ModuleList(
      (0): Categorical()
      (1): Categorical()
    )
  )
)

Next, we can check that the learned parameters are correct.

[6]:
model.distributions[0].probs, X[:,0].mean()
[6]:
(Parameter containing:
 tensor([[0.8000, 0.2000]]),
 0.2)

Looks like the model learned a marginal distribution where the probability of 1 is equal to the mean value, which is correct.

We can also look at the conditional probability table and compare against the probability that the second column is 1 (by taking the mean) when the first column is 0.

[7]:
model.distributions[1].probs[0], (X[X[:,0] == 0][:,1]).mean()
[7]:
(Parameter containing:
 tensor([[0.5000, 0.5000],
         [0.5000, 0.5000]]),
 0.5)

Also looks correct.

Finally, if we know the structure of the Bayesian network that we’d like to learn but don’t know the parameters of the tables, we can pass the structure into the initialization and call the fit function.

[8]:
model3 = BayesianNetwork(structure=[(), (0,)])
model3.fit(X)

model3.distributions[1].probs[0]
[8]:
Parameter containing:
tensor([[0.5000, 0.5000],
        [0.5000, 0.5000]])

Probabilities and Log Probabilities

Much like other generative models, Bayesian networks can calculate the probabilities of examples given the distributions and graph structure. Because the Bayesian network is just a factorization of the joint probability table along the graph structure, the probability of an example is just the product of the probability of each variable given its parents.

[9]:
model.probability(X)
[9]:
tensor([0.4000, 0.4000, 0.4000, 0.4000, 0.4000, 0.4000, 0.4000, 0.1000, 0.1000,
        0.4000])
[10]:
model.log_probability(X)
[10]:
tensor([-0.9163, -0.9163, -0.9163, -0.9163, -0.9163, -0.9163, -0.9163, -2.3026,
        -2.3026, -0.9163])
[11]:
model.distributions[0].log_probability(X[:,:1]) + model.distributions[1].log_probability(X[:, :, None])
[11]:
tensor([-0.9163, -0.9163, -0.9163, -0.9163, -0.9163, -0.9163, -0.9163, -2.3026,
        -2.3026, -0.9163])

Similarly to other models in pomegranate, the inputs can be lists, numpy arrays, or torch tensors.

Prediction

Perhaps the most useful application of a learned Bayesian network is the ability to do inference for missing values. Rather than a traditional prediction problem, which has a fixed set of inputs and one or more fixed outputs, Bayesian network inference will use any variables whose values are known to infer any variables whose values are not known. The set of known variables can change across examples, and so do not need to be known in advance.

In pomegranate, this is done using the loopy belief propogation algorithm, sometimes also called the “sum-product” algorithm. This algorithm is run on a factor graph, which is constructed in the backend. The trade-offs for this, versus normal junction-tree inference, are that the algorithm is faster, easier to implement, exact for tree-like Bayesian networks, and can provide estimates even for cyclic networks, but that the inference is not guaranteed to be exact in other cases or even to converge when the network is cyclic.

The implementation of the prediction methods differs slightly from other models in pomegranate. First, the unobserved variables are indicated using a torch.masked.MaskedTensor object, which holds the underlying data and a mask where True means the value is observed and False means that it is not observed. When the mask is False, it does not matter what the underlying value is.

[12]:
X_torch = torch.tensor(X[:4])
mask = torch.tensor([[True, False],
                     [False, True],
                     [True, True],
                     [False, False]])

X_masked = torch.masked.MaskedTensor(X_torch, mask=mask)
/home/jmschr/anaconda3/lib/python3.9/site-packages/torch/masked/maskedtensor/core.py:156: UserWarning: The PyTorch API of MaskedTensors is in prototype stage and will change in the near future. Please open a Github issue for features requests and see our documentation on the torch.masked module for further information about the project.
  warnings.warn(("The PyTorch API of MaskedTensors is in prototype stage "

If you have already set values in your tensor to some missing value, such as -1, you can easily just do torch.masked.MaskedTensor(X, mask=(X != -1)).

Once you’ve created your MaskedTensor you can pass it into the predict, predict_proba, or predict_log_proba methods of your Bayesian network. As a note, when data is provided, those values will appear in the output with a probability of 1. Probability distributions are only provided for unobserved variables.

[13]:
model.predict_proba(X_masked)
[13]:
[tensor([[1.0000, 0.0000],
         [0.8000, 0.2000],
         [1.0000, 0.0000],
         [0.8000, 0.2000]]),
 tensor([[0.5000, 0.5000],
         [1.0000, 0.0000],
         [0.0000, 1.0000],
         [0.5000, 0.5000]])]

You might notice that the output from these functions is a different shape than other methods. Because there is no guarantee that the variables all have the same number of categories, pomegranate cannot return a single tensor where one of the dimensions is the number of categories. Instead, pomegranate chooses to return a list of tensors, where each element in the list is one variable and the tensor has the dimensions (n_examples, n_categories) for the number of categories for that dimension. In principle, one could return a single tensor of size (n_examples, n_dimensions, max_n_categories) where max_n_categories is the maximum number of categories across all dimensions, but one would likely choose to slice the unneccesary categories out anyway, and there is no guarantee that a single variable with a large number of categories wouldn’t come along and massively increase the amount of needed memory.

[14]:
model.predict_log_proba(X_masked)
[14]:
[tensor([[ 0.0000,    -inf],
         [-0.2231, -1.6094],
         [ 0.0000,    -inf],
         [-0.2231, -1.6094]]),
 tensor([[-0.6931, -0.6931],
         [ 0.0000,    -inf],
         [   -inf,  0.0000],
         [-0.6931, -0.6931]])]

Unlike the previous two methods, the predict method will preserve the shape of the original data but replace the missing values, according to the mask, with the maximally likely value from the predict_proba method.

[15]:
model.predict(X_masked)
[15]:
tensor([[0, 0],
        [0, 0],
        [0, 1],
        [0, 0]])

Summarization

Summarization for Bayesian networks works entirely the same way as summarization for other models. When given complete data, summary statistics are derived using MLE that can be added together across batches. This means that Bayesian networks can be learned in an out-of-core manner.

Importantly, the Bayesian network must already have a structure or it will not know what statistics to calculate per-batch. This is because structure learning is not implemented in an out-of-core manner and would otherwise have to be done on the first batch. If this is what you want to do, then you should use fit on the first batch and the summarize on successive batches.

[16]:
X = torch.randint(2, size=(20, 4))

d1 = Categorical([[0.1, 0.9]])
d2 = Categorical([[0.3, 0.7]])
d3 = ConditionalCategorical([[[0.4, 0.6], [0.3, 0.7]]])
d4 = ConditionalCategorical([[[[0.8, 0.2], [0.1, 0.9]], [[0.1, 0.9], [0.5, 0.5]]]])

model = BayesianNetwork([d1, d2, d3, d4], [(d1, d3), (d2, d4), (d3, d4)])
model.summarize(X[:5])
model.summarize(X[5:])
model.from_summaries()

After fitting, we can check what the learned parameters are.

[17]:
model.distributions[0].probs, model.distributions[1].probs
[17]:
(Parameter containing:
 tensor([[0.6000, 0.4000]]),
 Parameter containing:
 tensor([[0.5500, 0.4500]]))

And we can compare them to the actual averages from the data, which should be the same as the learned parameters for the first two dimensions.

[18]:
torch.mean(X.type(torch.float32), dim=0)
[18]:
tensor([0.4000, 0.4500, 0.4500, 0.6000])

Sampling

[19]:
X_sample = model.sample(1000000).type(torch.float32)
[20]:
model._parents
[20]:
[(), (), (0,), (1, 2)]
[21]:
model.distributions[0].probs
[21]:
Parameter containing:
tensor([[0.6000, 0.4000]])
[22]:
model.distributions[1].probs
[22]:
Parameter containing:
tensor([[0.5500, 0.4500]])
[23]:
X_sample[:, :2].mean(dim=0)
[23]:
tensor([0.4006, 0.4501])
[24]:
model.distributions[2].probs[0]
[24]:
Parameter containing:
tensor([[0.4167, 0.5833],
        [0.7500, 0.2500]])
[25]:
X_sample[X_sample[:, 0] == 0, 2].mean(), X_sample[X_sample[:, 0] == 1, 2].mean()
[25]:
(tensor(0.5836), tensor(0.2500))
[26]:
model.distributions[3].probs[0]
[26]:
Parameter containing:
tensor([[[0.5000, 0.5000],
         [0.2000, 0.8000]],

        [[0.4000, 0.6000],
         [0.5000, 0.5000]]])
[27]:
X_sample[(X_sample[:, 1] == 1) & (X_sample[:, 2] == 0), 3].mean()
[27]:
tensor(0.5998)
[ ]: