Advanced data handling ====================== We have already seen how to use batch learning in the :doc:`tutorial`. The use of ``itertools.repeat`` to handle simple batch learning seemed rather over expressive; it makes more sense if we consider more advanced use cases. Mini batches ------------ Consider the case where we do not want to calculcate the full gradient in each iteration but estimate it given a mini batch. The contract of climin is that each item of the ``args`` iterator will be used in a single iteration and thus for one parameter update. The way to enable mini batch learning is thus to provide an ``args`` argument which is an infinite list of mini batches. Let's revisit our example of logistic regression. Here, we created the args iterator using ``itertools.repeat`` on the same array again and again:: import itertools args = itertools.repeat(([X, Z], {})) What we want to do now is to have an infinite stream of slices of ``X`` and ``Z``. How do we access the n'th batch of ``X`` and ``Z``? We offer you a convenience function that will give you random (with or without replacement) slices from a container:: batch_size = 100 args = ((i, {}) for i in climin.util.iter_minibatches([train_set_x, train_set_y], batch_size, [0, 0])) The last argument, ``[0, 0]`` gives the axes along which to slice ``[X, Z]``. In some cases, samples might be aligned along axis ``0`` for the input, but along axis ``1`` in the target data. External memory --------------- What is nice about ``climin.util.iter_minibatches`` is that it needs only slices as a requirement for its arguments. We therefore only need to pass it a data structure which reads data from disk as soon as it is needed and disposes of it as soon as it is not any more. HDF5 and its python package `h5py `_ are a perfect match for this. We have managed to use 6+ GB sized image data sets on GPUs with less than 2 GB of RAM with this simple recipe:: import climin.util import gnumpy import h5py f = h5py.File('data.h5') ds = f['inpts'] args = climin.util.iter_minibatches([ds], 100, [0]) args = (gnumpy.garray(i) for i in args) # ... This is in general not restricted by the size of the data set; it just show that going beyond the GPU RAM limit is achieved very naturally in climin. Further usages -------------- This architecture can be exploited in many different ways. E.g., a stream over a network can be directly used. A single pass over a file without keeping more than necessary is another option.