Out-of-Core Learning

author: Jacob Schreiber contact: jmschreiber91@gmail.com

Out-of-core learning refers to the process of training a model on an amount of data that cannot fit in memory. There are several approaches that can be described as out-of-core, but here we refer to the ability to derive exact updates to a model from a massive data set, despite not being able to fit the entire thing in memory.

This out-of-core learning approach is implemented for all of pomegranate’s models using two methods. The first is a summarize method that will take in a batch of data and reduce it down to additive sufficient statistics. Because these summaries are additive, after the first call, these summaries are added to the previously stored summaries. Once the entire data set has been seen, the stored sufficient statistics will be identical to those that would have been derived if the entire data set had been seen at once. The second method is the from_summaries method, which uses the stored sufficient statistics to derive parameter updates for the model.

A common solution to having too much data is to randomly select an amount of data that does fit in memory to use in the place of the full data set. While simple to implement, this approach is likely to yield lower performance models because it is exposed to less data. However, by using out-of-core learning, on can train their models on a massive amount of data without being limited by the amount of memory their computer has.

[1]:
%pylab inline
import torch

numpy.random.seed(0)
numpy.set_printoptions(suppress=True)

%load_ext watermark
%watermark -m -n -p torch,pomegranate
Populating the interactive namespace from numpy and matplotlib
torch      : 1.13.0
pomegranate: 1.0.0

Compiler    : GCC 11.2.0
OS          : Linux
Release     : 4.15.0-208-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 8
Architecture: 64bit

summarize and from_summaries

Let’s start off simple with training a normal distribution in an out-of-core manner. First, we’ll generate some random data.

[2]:
X = torch.randn(1000, 5)

Then, we can initialize a distribution.

[3]:
from pomegranate.distributions import Normal

dist = Normal()

Now let’s summarize through a few batches of data using the summarize method.

[4]:
dist.summarize(X[:200])
dist.summarize(X[200:])

Importantly, summarizing data doesn’t update parameters by itself. Rather, it extracts additive sufficient statistics from the data. Each time summarize is called, these statistics are added to the previously aggregated statistics.

In order to update the parameters of the model, you need to call the from_summaries method. This method updates the parameters of the model given the stored sufficient statistics.

[5]:
dist.from_summaries()
dist.means, dist.covs
[5]:
(Parameter containing:
 tensor([ 0.0175,  0.0096,  0.0228,  0.0592, -0.0089]),
 Parameter containing:
 tensor([[ 0.9786, -0.0106,  0.0344,  0.0571,  0.0330],
         [-0.0106,  0.9970,  0.0165, -0.0330,  0.0021],
         [ 0.0344,  0.0165,  0.9405, -0.0075, -0.0374],
         [ 0.0571, -0.0330, -0.0075,  1.0399,  0.0333],
         [ 0.0330,  0.0021, -0.0374,  0.0333,  0.9978]]))

This update is exactly the same as one would get if they had trained on the entire data set.

[6]:
dist = Normal()
dist.summarize(X)
dist.from_summaries()
dist.means, dist.covs
[6]:
(Parameter containing:
 tensor([ 0.0175,  0.0096,  0.0228,  0.0592, -0.0089]),
 Parameter containing:
 tensor([[ 0.9786, -0.0106,  0.0344,  0.0571,  0.0330],
         [-0.0106,  0.9970,  0.0165, -0.0330,  0.0021],
         [ 0.0344,  0.0165,  0.9405, -0.0075, -0.0374],
         [ 0.0571, -0.0330, -0.0075,  1.0399,  0.0333],
         [ 0.0330,  0.0021, -0.0374,  0.0333,  0.9978]]))

Batched Training

Sometimes your data is so large that it cannot fit in memory (either CPU or GPU). In these cases, we can use the out-of-core API to train on batches at a time. This is similar to how neural networks are trained except that, rather than updating after each batch (or aggregating gradients over a small number of batches), we can summarize over a much larger number of batches – potentially even the entire data set to get an exact update. Let’s see an example of how that might work.

[7]:
dist = Normal()

for i in range(10):
    X_batch = torch.randn(1000, 20) # This is meant to mimic loading a batch of data
    dist.summarize(X_batch)
    del X_batch # Now we can discard the batch

dist.from_summaries()

Batched training is easy to implement for simple probability distributions but it can also be done with more complicated models if you want to code your own expectation-maximization. For instance, let’s try training a mixture model using a modified version of the training code.

[8]:
from pomegranate.gmm import GeneralMixtureModel

X = torch.randn(10000, 20)

model = GeneralMixtureModel([Normal(), Normal()])

logp = None
for i in range(5):
    start_time = time.time()

    last_logp = logp

    logp = 0
    for j in range(0, X.shape[0], 1000): # Train on batches of size 1000
        logp += model.summarize(X[j:j+1000])

    if i > 0:
        improvement = logp - last_logp
        duration = time.time() - start_time
        print("[{}] Improvement: {}, Time: {:4.4}s".format(i, improvement, duration))

    model.from_summaries()
[1] Improvement: 1945.53125, Time: 0.01443s
[2] Improvement: 99.875, Time: 0.01562s
[3] Improvement: 34.1875, Time: 0.01019s
[4] Improvement: 17.65625, Time: 0.00994s