Markov Networks¶

Markov networks (sometimes called Markov random fields) are probabilistic models that are typically represented using an undirected graph. Each of the nodes in the graph represents a variable in the data and each of the edges represent an associate. Unlike Bayesian networks which have directed edges and clear directions of causality, Markov networks have undirected edges and only encode associations.

Currently, pomegranate only supports discrete Markov networks, meaning that the values must be categories, i.e. ‘apples’ and ‘oranges’, or 1 and 2, where 1 and 2 refer to categories, not numbers, and so 2 is not explicitly ‘bigger’ than 1.

Initialization¶

Markov networks can be initialized in two ways, depending on whether the underlying graphical structure is known or not: (1) a list of the joint probabilities tables can be passed into the initialization, with one table per clique in the graph, or (2) both the graphical structure and distributions can be learned directly from data. This mirrors the other models that are implemented in pomegranate. However, because finding the optimal Markov network requires enumerating a number of potential graphs that is exponential with the number of dimensions in the data, it can be fairly time intensive to find the exact network.

Let’s see an example of creating a Markov network with three cliques in it.

from pomegranate import *

d1 = JointProbabilityTable([
[0, 0, 0.1],
[0, 1, 0.2],
[1, 0, 0.4],
[1, 1, 0.3]], [0, 1])

d2 = JointProbabilityTable([
[0, 0, 0, 0.05],
[0, 0, 1, 0.15],
[0, 1, 0, 0.07],
[0, 1, 1, 0.03],
[1, 0, 0, 0.12],
[1, 0, 1, 0.18],
[1, 1, 0, 0.10],
[1, 1, 1, 0.30]], [1, 2, 3])

d3 = JointProbabilityTable([
[0, 0, 0, 0.08],
[0, 0, 1, 0.12],
[0, 1, 0, 0.11],
[0, 1, 1, 0.19],
[1, 0, 0, 0.04],
[1, 0, 1, 0.06],
[1, 1, 0, 0.23],
[1, 1, 1, 0.17]], [2, 3, 4])

model = MarkovNetwork([d1, d2, d3])
model.bake()


That was fairly simple. Each JointProbabilityTable object just had to include the table of all values that the variables can take as well as a list of variable indexes that are included in the table, in the order from left to right that they appear. For example, in d1, the first column of the table corresponds to the first column of data in a data matrix and the second column in the table corresponds to the second column in a data matrix.

One can also initialize a Markov network based completely on data. Currently, the only algorithm that pomegranate supports for this is the Chow-Liu tree-building algorithm. This algorithm first calculates the mutual information between all pairs of variables and then determines the maximum spanning tree through it. This process generally captures the strongest dependencies in the data set. However, because it requires all variables to have at least one connection, it can lead to instances where variables are incorrectly associated with each other. Overall, it generally performs well and it fairly fast to calculate.

from pomegranate import *
import numpy

X = numpy.random.randint(2, size=(100, 6))
model = MarkovNetwork.from_samples(X)


Probability¶

The probability of an example under a Markov network is more difficult to calculate than under a Bayesian network. With a Bayesian network, one can simply multiply the probabilities of each variable given its parents to get a probability of the entire example. However, repeating this process for a Markov network (by plugging in the values of each clique and multiplying across all cliques) results in a value called the “unnormalized” probability. This value is called “unnormalized” because the sum of this value across all combinations of values that the variables in an example can take does not sum to 1.

The normalization of an “unnormalized” probability requires the calculation of a partition function. This function (frequently abbreviated Z) is just the sum of the probability of all combinations of values that the variables can take. After calculation, one can just divide the unnormalized probability by this value to get the normalized probability. The only problem is that the calculation of the partition function requires the summation over a number of examples that grows exponentially with the number of dimensions. You can read more about this in the tutorial.

If you have a small number of variables (<30) it shouldn’t be a problem to calculate the partition function and then normalized probabilities.

>>> print(model.probability([1, 0, 1, 0, 1]))
-4.429966143312331


Prediction¶

Markov networks can be used to predict the value of missing variables given the observed values in a process called “inference.” In other predictive models there are typically a single or fixed set of missing values that need to be predicted, commonly referred to as the labels. However, in the case of Markov (or Bayesian) networks, the missing values can be any variables and the inference process will use all of the available data to impute those missing values. For example:

>>> print(model.predict([[None, 0, None, 1, None]]))
[[1, 0, 0, 1, 1]]


API Reference¶

class pomegranate.MarkovNetwork.MarkovNetwork

A Markov Network Model.

A Markov network is an undirected graph where nodes represent variables, edges represent associations between the variables, and the lack of an edge represents a conditional independence.

Parameters: distributions : list, tuple, or numpy.ndarray A collection of joint probability distributions that represent the name : str, optional The name of the model. Default is None
bake()

Finalize the topology of the underlying factor graph model.

Assign a numerical index to every clique and create the underlying factor graph model. This method must be called before any of the probability-calculating or inference methods because the probability calculating methods rely on the partition function and the inference methods rely on the factor graph.

Parameters: calculate_partition : bool, optional Whether to calculate the partition function. This is not necessary if the goal is simply to perform inference, but is required if the goal is to calculate the probability of examples under the model. None
clear_summaries()

Clear the summary statistics stored in the object.

copy()

Return a deep copy of this distribution object.

This object will not be tied to any other distribution or connected in any form.

Parameters: None distribution : Distribution A copy of the distribution with the same parameters.
fit()

Fit the model to data using MLE estimates.

Fit the model to the data by updating each of the components of the model, which are univariate or multivariate distributions. This uses a simple MLE estimate to update the distributions according to their summarize or fit methods.

This is a wrapper for the summarize and from_summaries methods.

Parameters: X : array-like, shape (n_samples, n_nodes) The data to train on, where each row is a sample and each column corresponds to the associated variable. weights : array-like, shape (n_nodes), optional The weight of each sample as a positive double. Default is None. inertia : double, optional The inertia for updating the distributions, passed along to the distribution method. Default is 0.0. pseudocount : double, optional A pseudocount to add to the emission of each distribution. This effectively smoothes the states to prevent 0. probability symbols if they don’t happen to occur in the data. Only effects hidden Markov models defined over discrete distributions. Default is 0. verbose : bool, optional Whether or not to print out improvement information over iterations. Only required if doing semisupervised learning. Default is False. calculate_partition : bool, optional Whether to calculate the partition function. This is not necessary if the goal is simply to perform inference, but is required if the goal is to calculate the probability of examples under the model. n_jobs : int The number of jobs to use to parallelize, either the number of threads or the number of processes to use. -1 means use all available resources. Default is 1. self : MarkovNetwork The fit Markov network object with updated model parameters.
freeze()

Freeze the distribution, preventing updates from occurring.

from_json()

Read in a serialized Markov Network and return the appropriate object.

Parameters: s : str A JSON formatted string containing the file. model : object A properly initialized and baked model.
from_samples()

Learn the structure of the network from data.

Find the structure of the network from data using a Markov structure learning score. This currently enumerates all the exponential number of structures and finds the best according to the score. This allows weights on the different samples as well. The score that is optimized is the minimum description length (MDL).

If not all states for a variable appear in the supplied data, this function can not gurantee that the returned Markov Network is optimal when ‘exact’ or ‘exact-dp’ is used. This is because the number of states for each node is derived only from the data provided, and the scoring function depends on the number of states of a variable.

Parameters: X : array-like, shape (n_samples, n_nodes) The data to fit the structure too, where each row is a sample and each column corresponds to the associated variable. weights : array-like, shape (n_nodes), optional The weight of each sample as a positive double. Default is None. algorithm : str, one of ‘chow-liu’, ‘greedy’, ‘exact’, ‘exact-dp’ optional The algorithm to use for learning the Bayesian network. Default is ‘greedy’ that greedily attempts to find the best structure, and frequently can identify the optimal structure. ‘exact’ uses DP/A* to find the optimal Bayesian network, and ‘exact-dp’ tries to find the shortest path on the entire order lattice, which is more memory and computationally expensive. ‘exact’ and ‘exact-dp’ should give identical results, with ‘exact-dp’ remaining an option mostly for debugging reasons. ‘chow-liu’ will return the optimal tree-like structure for the Bayesian network, which is a very fast approximation but not always the best network. max_parents : int, optional The maximum number of parents a node can have. If used, this means using the k-learn procedure. Can drastically speed up algorithms. If -1, no max on parents. Default is -1. root : int, optional For algorithms which require a single root (‘chow-liu’), this is the root for which all edges point away from. User may specify which column to use as the root. Default is the first column. constraint_graph : networkx.DiGraph or None, optional A directed graph showing valid parent sets for each variable. Each node is a set of variables, and edges represent which variables can be valid parents of those variables. The naive structure learning task is just all variables in a single node with a self edge, meaning that you know nothing about pseudocount : double, optional A pseudocount to add to the emission of each distribution. This effectively smoothes the states to prevent 0. probability symbols if they don’t happen to occur in the data. Default is 0. name : str, optional The name of the model. Default is None. reduce_dataset : bool, optional Given the discrete nature of these datasets, frequently a user will pass in a dataset that has many identical samples. It is time consuming to go through these redundant samples and a far more efficient use of time to simply calculate a new dataset comprised of the subset of unique observed samples weighted by the number of times they occur in the dataset. This typically will speed up all algorithms, including when using a constraint graph. Default is True. n_jobs : int, optional The number of threads to use when learning the structure of the network. If a constraint graph is provided, this will parallelize the tasks as directed by the constraint graph. If one is not provided it will parallelize the building of the parent graphs. Both cases will provide large speed gains. model : MarkovNetwork The learned Markov Network.
from_structure()

Return a Markov network from a predefined structure.

Pass in the structure of the network as a tuple of tuples and get a fit network in return. The tuple should contain n tuples, with one for each node in the graph. Each inner tuple should be of the parents for that node. For example, a three node graph where both node 0 and 1 have node 2 as a parent would be specified as ((2,), (2,), ()).

Parameters: X : array-like, shape (n_samples, n_nodes) The data to fit the structure too, where each row is a sample and each column corresponds to the associated variable. structure : tuple of tuples The parents for each node in the graph. If a node has no parents, then do not specify any parents. weights : array-like, shape (n_nodes), optional The weight of each sample as a positive double. Default is None. pseudocount : double, optional A pseudocount to add to the emission of each distribution. This effectively smoothes the states to prevent 0. probability symbols if they don’t happen to occur in the data. Default is 0. model : MarkovNetwork A Markov network with the specified structure.
from_summaries()

Use MLE on the stored sufficient statistics to train the model.

Parameters: inertia : double, optional The inertia for updating the distributions, passed along to the distribution method. Default is 0.0. pseudocount : double, optional A pseudocount to add to the emission of each distribution. This effectively smoothes the states to prevent 0. probability symbols if they don’t happen to occur in the data. Default is 0. calculate_partition : bool, optional Whether to calculate the partition function. This is not necessary if the goal is simply to perform inference, but is required if the goal is to calculate the probability of examples under the model. None
from_yaml()

Deserialize this object from its YAML representation.

log_probability()

Return the log probability of samples under the Markov network.

The log probability is just the sum of the log probabilities under each of the components minus the partition function. This method will return a vector of log probabilities, one for each sample.

Parameters: X : array-like, shape (n_samples, n_dim) The sample is a vector of points where each dimension represents the same variable as added to the graph originally. It doesn’t matter what the connections between these variables are, just that they are all ordered the same. n_jobs : int, optional The number of jobs to use to parallelize, either the number of threads or the number of processes to use. -1 means use all available resources. Default is 1. unnormalized : bool, optional Whether to return the unnormalized or normalized probabilities. The normalized probabilities requires the partition function to be calculated. logp : numpy.ndarray or double The log probability of the samples if many, or the single log probability.
marginal()

Return the marginal probabilities of each variable in the graph.

This is equivalent to a pass of belief propagation on a graph where no data has been given. This will calculate the probability of each variable being in each possible emission when nothing is known.

Parameters: None marginals : array-like, shape (n_nodes) An array of univariate distribution objects showing the marginal probabilities of that variable.
predict()

Predict missing values of a data matrix using MLE.

Impute the missing values of a data matrix using the maximally likely predictions according to the loopy belief propagation (also known as the forward-backward) algorithm. Run each example through the algorithm (predict_proba) and replace missing values with the maximally likely predicted emission.

Parameters: X : array-like, shape (n_samples, n_nodes) Data matrix to impute. Missing values must be either None (if lists) or np.nan (if numpy.ndarray). Will fill in these values with the maximally likely ones. max_iterations : int, optional Number of iterations to run loopy belief propagation for. Default is 100. n_jobs : int The number of jobs to use to parallelize, either the number of threads or the number of processes to use. -1 means use all available resources. Default is 1. y_hat : numpy.ndarray, shape (n_samples, n_nodes) This is the data matrix with the missing values imputed.
predict_proba()

Returns the probabilities of each variable in the graph given evidence.

This calculates the marginal probability distributions for each state given the evidence provided through loopy belief propagation. Loopy belief propagation is an approximate algorithm which is exact for certain graph structures.

Parameters: X : dict or array-like, shape <= n_nodes The evidence supplied to the graph. This can either be a dictionary with keys being state names and values being the observed values (either the emissions or a distribution over the emissions) or an array with the values being ordered according to the nodes incorporation in the graph and None for variables which are unknown. It can also be vectorized, so a list of dictionaries can be passed in where each dictionary is a single sample, or a list of lists where each list is a single sample, both formatted as mentioned before. The preferred method is as an numpy array. max_iterations : int, optional The number of iterations with which to do loopy belief propagation. Usually requires only 1. Default is 100. check_input : bool, optional Check to make sure that the observed symbol is a valid symbol for that distribution to produce. Default is True. n_jobs : int, optional The number of threads to use when parallelizing the job. This parameter is passed directly into joblib. Default is 1, indicating no parallelism. y_hat : array-like, shape (n_samples, n_nodes) An array of univariate distribution objects showing the probabilities of each variable.
probability()

Return the probability of samples under the Markov network.

This is just a wrapper that exponentiates the result from the log probability method.

Parameters: X : array-like, shape (n_samples, n_dim) The sample is a vector of points where each dimension represents the same variable as added to the graph originally. It doesn’t matter what the connections between these variables are, just that they are all ordered the same. n_jobs : int, optional The number of jobs to use to parallelize, either the number of threads or the number of processes to use. -1 means use all available resources. Default is 1. unnormalized : bool, optional Whether to return the unnormalized or normalized probabilities. The normalized probabilities requires the partition function to be calculated. prob : numpy.ndarray or double The log probability of the samples if many, or the single log probability.
sample()

Return a random item sampled from this distribution.

Parameters: n : int or None, optional The number of samples to return. Default is None, which is to generate a single sample. sample : double or object Returns a sample from the distribution of a type in the support of the distribution.
score()

Return the accuracy of the model on a data set.

Parameters: X : numpy.ndarray, shape=(n, d) The values of the data set y : numpy.ndarray, shape=(n,) The labels of each value
summarize()

Summarize a batch of data and store the sufficient statistics.

This will partition the dataset into columns which belong to their appropriate distribution. If the distribution has parents, then multiple columns are sent to the distribution. This relies mostly on the summarize function of the underlying distribution.

Parameters: X : array-like, shape (n_samples, n_nodes) The data to train on, where each row is a sample and each column corresponds to the associated variable. weights : array-like, shape (n_nodes), optional The weight of each sample as a positive double. Default is None. None
thaw()

Thaw the distribution, re-allowing updates to occur.

to_json()

Serialize the model to a JSON.

Parameters: separators : tuple, optional The two separators to pass to the json.dumps function for formatting. indent : int, optional The indentation to use at each level. Passed to json.dumps for formatting. json : str A properly formatted JSON object.
to_yaml()

Serialize the model to YAML for compactness.

pomegranate.MarkovNetwork.discrete_chow_liu_tree()

Find the Chow-Liu tree that spans a data set.

The Chow-Liu algorithm first calculates the mutual information between each pair of variables and then constructs a maximum spanning tree given that. This algorithm slightly from the one implemented for Bayesian networks because Bayesian networks are directed and need a node to be the root. In contrast, the structure here is undirected and so is a simple maximum spanning tree.