GPU Usage

author: Jacob Schreiber contact: jmschreiber91@gmail.com

Because pomegranate models are all instances of torch.nn.Module, you can do anything with them that you could do with other PyTorch models. This includes using GPUs or any other device that is supported by PyTorch using exactly the same method calls. Here, we will see how to use GPUs to speed up training and inference.

[1]:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

%pylab inline
import seaborn; seaborn.set_style('whitegrid')

import torch

numpy.random.seed(0)
numpy.set_printoptions(suppress=True)

%load_ext watermark
%watermark -m -n -p numpy,scipy,torch,pomegranate

Populating the interactive namespace from numpy and matplotlib
numpy      : 1.23.4
scipy      : 1.9.3
torch      : 1.13.0
pomegranate: 1.0.0

Compiler    : GCC 11.2.0
OS          : Linux
Release     : 4.15.0-208-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 8
Architecture: 64bit

Overview

All models and methods in pomegranate can be sped up using GPUs using the exact same methods as other PyTorch models. Let’s see that in action.

[2]:

from pomegranate.distributions import Normal

X = torch.randn(5000, 5)
dist = Normal().fit(X)

dist.means, dist.covs

[2]:

(Parameter containing:
 tensor([-0.0236, -0.0205, -0.0137,  0.0102,  0.0062]),
 Parameter containing:
 tensor([[ 1.0031e+00, -7.6397e-04,  1.2268e-03,  1.8175e-02, -1.2175e-02],
         [-7.6397e-04,  1.0070e+00,  1.0800e-02,  1.4855e-02,  2.5852e-02],
         [ 1.2268e-03,  1.0800e-02,  1.0058e+00,  7.3037e-03, -1.7321e-03],
         [ 1.8175e-02,  1.4855e-02,  7.3037e-03,  1.0094e+00, -7.5648e-03],
         [-1.2175e-02,  2.5852e-02, -1.7321e-03, -7.5648e-03,  9.8117e-01]]))

All we need to do is use the .cuda() method or the .to(device) method on the data and the model. Similar to PyTorch, the model will not automatically move data to the GPU for you. You have to do this yourself.

[3]:

dist = Normal().cuda().fit(X.cuda())

dist.means, dist.covs

[3]:

(Parameter containing:
 tensor([-0.0236, -0.0205, -0.0137,  0.0102,  0.0062], device='cuda:0'),
 Parameter containing:
 tensor([[ 1.0031e+00, -7.6398e-04,  1.2268e-03,  1.8175e-02, -1.2175e-02],
         [-7.6398e-04,  1.0070e+00,  1.0800e-02,  1.4855e-02,  2.5852e-02],
         [ 1.2268e-03,  1.0800e-02,  1.0058e+00,  7.3037e-03, -1.7321e-03],
         [ 1.8175e-02,  1.4855e-02,  7.3037e-03,  1.0094e+00, -7.5648e-03],
         [-1.2175e-02,  2.5852e-02, -1.7321e-03, -7.5648e-03,  9.8117e-01]],
        device='cuda:0'))

All models operate in the same way.

[4]:

from pomegranate.gmm import GeneralMixtureModel

model = GeneralMixtureModel([Normal(), Normal()], max_iter=5, verbose=True).cuda()
model.fit(X.cuda())

[1] Improvement: 134.63671875, Time: 0.001251s
[2] Improvement: 38.76953125, Time: 0.001229s
[3] Improvement: 17.02734375, Time: 0.001173s
[4] Improvement: 9.13671875, Time: 0.001179s

[4]:

GeneralMixtureModel(
  (distributions): ModuleList(
    (0-1): 2 x Normal()
  )
)

Timing Examples

Using a GPU helps the most when the workload is complex. So, we will see only minimal gains on small, simple, workloads like basic probability distributions. For these evaluations we will do timings on the CPU, using the GPU but including the time to transfer everything there, and using the GPU once everything is already there. All evaluations are done on an A100.

[5]:

n, d = 100, 5

X = torch.randn(n, d)
X_cuda = X.cuda()

mu = torch.randn(d)
cov = torch.exp(torch.randn(d))

d = Normal(mu, cov, covariance_type='diag')
d_cuda = Normal(mu, cov, covariance_type='diag').cuda()

%timeit d.log_probability(X)
%timeit d.cuda().log_probability(X.cuda())
%timeit d_cuda.log_probability(X_cuda)

28.4 µs ± 107 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
91.3 µs ± 361 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
54.5 µs ± 170 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

For this extremely small workload, using a GPU is much slower than just doing everything on the CPU. Let’s try a larger example.

[6]:

n, d = 10000, 50

X = torch.randn(n, d)
X_cuda = X.cuda()

mu = torch.randn(d)
cov = torch.exp(torch.randn(d))

d = Normal(mu, cov, covariance_type='diag')
d_cuda = Normal(mu, cov, covariance_type='diag').cuda()

%timeit d.log_probability(X)
%timeit d.cuda().log_probability(X.cuda())
%timeit d_cuda.log_probability(X_cuda)

131 µs ± 7.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
256 µs ± 108 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
54.2 µs ± 68.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Looks like the CPU time increased several times over but the GPU times didn’t change as much. This is because using a GPU has a relatively large fixed cost but the variable cost associating with increasing the size of the data doesn’t increase nearly as fast. Let’s try a huge example.

[7]:

n, d = 100000, 5000

X = torch.randn(n, d)
X_cuda = X.cuda()

mu = torch.randn(d)
cov = torch.exp(torch.randn(d))

d = Normal(mu, cov, covariance_type='diag')
d_cuda = Normal(mu, cov, covariance_type='diag').cuda()

%timeit d.log_probability(X)
%timeit d.cuda().log_probability(X.cuda())
%timeit d_cuda.log_probability(X_cuda)

844 ms ± 90 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
213 ms ± 639 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
11.7 ms ± 2.97 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

We can see the expected results here: despite having to transfer data to and from the GPU it is faster to do it than use a CPU for large data sets, and if you’re in a setting where everything is already on the GPU you can get huge speed boosts.

Now, if you have a more complicated model, you can unlock even larger speed boosts.

[8]:

from pomegranate.kmeans import KMeans

model1 = KMeans(512)
model2 = KMeans(512)

%timeit -n 1 -r 1 model1.fit(X)
%timeit -n 1 -r 1 model2.cuda().fit(X.cuda())

10.1 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
643 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

[9]:

del X, model1, model2

Seems significantly faster.

Now, let’s try with an even more complex model: the dense hidden Markov model.

[10]:

from pomegranate.hmm import DenseHMM

n, l, d = 1000, 25, 15
X = torch.randn(n, l, d)

k = 256

dists1, dists2 = [], []
for i in range(k):
    mu = torch.randn(d)
    covs = torch.exp(torch.randn(d))

    dist1 = Normal(mu, covs, covariance_type='diag')
    dist2 = Normal(mu, covs, covariance_type='diag').cuda()

    dists1.append(dist1)
    dists2.append(dist2)


model1 = DenseHMM(dists1, max_iter=3)
model2 = DenseHMM(dists2, max_iter=3).cuda()

X_cuda = X.cuda()

%timeit -n 1 -r 1 model1.fit(X)
%timeit -n 1 -r 1 model2.fit(X_cuda)

8.23 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
1.08 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)