Advanced data handling

We have already seen how to use batch learning in the Tutorial. The use of itertools.repeat to handle simple batch learning seemed rather over expressive; it makes more sense if we consider more advanced use cases.

Mini batches

Consider the case where we do not want to calculcate the full gradient in each iteration but estimate it given a mini batch. The contract of climin is that each item of the args iterator will be used in a single iteration and thus for one parameter update. The way to enable mini batch learning is thus to provide an args argument which is an infinite list of mini batches.

Let’s revisit our example of logistic regression. Here, we created the args iterator using itertools.repeat on the same array again and again:

import itertools
args = itertools.repeat(([X, Z], {}))

What we want to do now is to have an infinite stream of slices of X and Z. How do we access the n’th batch of X and Z? We offer you a convenience function that will give you random (with or without replacement) slices from a container:

batch_size = 100
args = ((i, {}) for i in climin.util.iter_minibatches([train_set_x, train_set_y], batch_size, [0, 0]))

The last argument, [0, 0] gives the axes along which to slice [X, Z]. In some cases, samples might be aligned along axis 0 for the input, but along axis 1 in the target data.

External memory

What is nice about climin.util.iter_minibatches is that it needs only slices as a requirement for its arguments. We therefore only need to pass it a data structure which reads data from disk as soon as it is needed and disposes of it as soon as it is not any more.

HDF5 and its python package h5py are a perfect match for this. We have managed to use 6+ GB sized image data sets on GPUs with less than 2 GB of RAM with this simple recipe:

import climin.util
import gnumpy
import h5py

f = h5py.File('data.h5')
ds = f['inpts']
args = climin.util.iter_minibatches([ds], 100, [0])
args = (gnumpy.garray(i) for i in args)

# ...

This is in general not restricted by the size of the data set; it just show that going beyond the GPU RAM limit is achieved very naturally in climin.

Further usages

This architecture can be exploited in many different ways. E.g., a stream over a network can be directly used. A single pass over a file without keeping more than necessary is another option.