Out of Core Learning

Sometimes datasets which we’d like to train on can’t fit in memory but we’d still like to get an exact update. pomegranate supports out of core training to allow this, by allowing models to summarize batches of data into sufficient statistics and then later on using these sufficient statistics to get an exact update for model parameters. These are done through the methods `model.summarize` and `model.from_summaries`. Let’s see an example of using it to update a normal distribution.

>>> from pomegranate import *
>>> import numpy
>>>
>>> a = NormalDistribution(1, 1)
>>> b = NormalDistribution(1, 1)
>>> X = numpy.random.normal(3, 5, size=(5000,))
>>>
>>> a.fit(X)
>>> a
{
    "frozen" :false,
    "class" :"Distribution",
    "parameters" :[
        3.012692830297519,
        4.972082359070984
    ],
    "name" :"NormalDistribution"
}
>>> for i in range(5):
>>>     b.summarize(X[i*1000:(i+1)*1000])
>>> b.from_summaries()
>>> b
{
    "frozen" :false,
    "class" :"Distribution",
    "parameters" :[
        3.01269283029752,
        4.972082359070983
    ],
    "name" :"NormalDistribution"
}

This is a simple example with a simple distribution, but all models and model stacks support this type of learning. Lets next look at a simple Bayesian network.

We can see that before fitting to any data, the distribution in one of the states is equal for both. After fitting the first distribution they become different as would be expected. After fitting the second one through summarize the distributions become equal again, showing that it is recovering an exact update.

It’s easy to see how one could use this to update models which don’t use Expectation Maximization (EM) to train, since it is an iterative algorithm. For algorithms which use EM to train there is a `fit` wrapper which will allow you to load up batches of data from a numpy memory map to train on automatically.

FAQ

  1. What data storage types are able to be used with out of core training?
  1. Currently only stored numpy arrays (.npy files) that can be read as memory maps using numpy.load(‘data.npy’, mmap_mode=’r’) are supported for data that truly can’t be loaded into memory.
  1. Does out of core learning give exact or approximate updates?
  1. It gives exact updates. Sufficient statistics are collected for each of the batches and are equal to the sufficient statistics that one would get from the full dataset.