climin: optimization, straight-forward¶
introduction¶
Optimization is a key ingredient in modern machine learning. While many models can be optimized in specific ways, several need very general gradient based techniques–e.g. neural networks. What’s even worse is that you never know whether your model does not work or you just have not found the right optimizer.
This is where climin comes in. We offer you many different off-the-shelf optimizers, such as LBFGS, stochastic gradient descent with Nesterov momentum, nonlinear conjugate gradients, resilient propagation, rmsprop and more.
But what is best, is that we know that optimization is a process that needs to be analyzed. This is why climin does not offer you a black box function call that takes ages to run and might give you a good minimum of your training loss. Since this is not what you care about, climin takes you with you on its travel through the error landscape... in a classic for loop:
import climin
network = make_network() # your neural network
training_data = load_train_data() # your training data
validation_data = load_validation_data() # your validation data
test_data = load_test_data() # your testing data
opt = climin.Lbfgs(network.parameters,
network.loss,
network.d_loss_d_parameters,
args=itertools.repeat((training_data, {})))
for info in opt:
validation_loss = network.loss(network.parameters, validation_data)
print info['loss'], validation_loss
print network.loss(test_data)
Climin works on the CPU (via numpy and scipy) and in parts on the GPU (via gnumpy).
Starting points¶
If you want to see how climin works and use climin asap, check out the Tutorial. Details on Installation are available. A list of the optimizers implemented can be found in the overview below. If you want to understand the design decisions of climin, read the Manifest.
Contact & Community¶
Climin was started by people from the Biomimetic robotics and machine learning group at the Technische Universitaet Muenchen. If you have any questions, there is a mailinglist at climin@librelist.com. Just send an email to subscribe or checkout the archive. The software is hosted over at our github repository, where you can also find our issue tracker.
Meta¶
Manifest¶
Climin was started in the beginning with the observation that plain stochastic gradient descent was not able to find good solutions for sparse filtering [sparsefiltering]. The original article mentions the use of LBFGS as an optimization method, yet the implementation offered by scipy did not solve the problem for us immediately. Since matlab has a very powerful optimization library, we decided that it was time for Python to catch up in this respect.
We found several requirements for a good optimization library.
- Optimization in machine learning is mostly accompanied by online evaluation code: live plotting of error curves, parameters or sending you an email once your model has beaten the current state of the art. Also, you might have your own stopping criterions. We call this the “side logic” of an optimization. Every user has his own way of dealing with this side logic, and a lot of throw away code is being written for this. We wanted to make this part as easy as possible for the user.
- Do one thing, and do that right: climin is independent of your models and the way you work with optimization. We have a simple protocol: loss functions (and their derivatives) and a parameter array. Also, we do not force any framework on you on and come up with things that try to solve everything.
- Most of the optimizers, i.e. those that do not rely on too much linear algebra such as matrix inversions, should not only work on the CPU via numpy but also on the GPU via gnumpy.
- Optimizers should be easily switchable; if we have a model and a loss we wanted to be able to quickly experiment with different methods.
- Optimizers should be reasonably fast. Most of the computational work in machine learning is done within the models anyway. Yet, we want a clean python code base without C extensions. We also found that speeding up everything with Cython would be a good way to go where necessary. Since numba is around the corner, we wanted to decide this in a later version.
- Make development of new optimizers straight forward. The implementation of every optimizer has very little overhead, the most of it being assigning hyper parameters to class values.
The main idea of climin is to treat optimizers as iterators. This allows to have the logic surrounding it right in the same scope and written code block as the optimization. Also, callbacks are really ugly! Python has better tools for that.
[sparsefiltering] | Ngiam, Jiquan, et al. “Sparse filtering.” Advances in Neural Information Processing Systems. 2011. |
Installation¶
Climin has been tested on Python 2.7 with numpy 1.8 and scipy 0.13. Currently, it is only available via git from github:
$ git clone https://github.com/BRML/climin.git
After that, pip can be used to install it on the Python path:
$ cd climin
$ pip install .
If you want to know whether everything works as we expect it, run the test suite:
$ nosetests tests/
for which nose is required.
Basics¶
Tutorial¶
In the following we will explain the basic features of climin with a simple example. For that we will first use a multinomial logistic regression, which suffices to show much of climin’s functionality.
Although we will use numpy in this example, we have spent quite some effort to make most optimizers work with gnumpy as well. This makes the use of GPUs possible. Check the reference documentation for specific optimizers whether the usage of GPU is supported.
Defining a Loss Function¶
At the heart of optimizitation lies the objective function we wish to optimize. In the case of climin, we will always be concernced with minimization. Even though algorithms are sometimes defined with respect to maximization (e.g.i Bayesian optimization or evolution strategies) in the literature. Thus, we will also be talking about loss functions.
A loss function in climin follows a simple protocol: a callable (e.g. a function) which takes an array with a single dimension as input and returns a scalar. An example would be a simple polynomial of degree 2:
def loss(x):
return (x ** 2).sum()
In machine learning, this will mostly be the parameters of our model. Additionally, we will often have further arguments to the loss, the most important being the data that our model works on.
Logistic Regression¶
Optimizing a scalar quadratic with an iterative technique is all nice, but we want to do more complicated things. We will thus move on to use logistic regression.
In climin, the parameter vector will always be one dimensional. Your loss function will have to unpack that vector into the various parameters of different shapes. While this might seem tedious at first, it makes some calculations much easier and also more efficient.
Logistic regression has commonly two different parameter sets, the weight matrix (or coefficients) and the bias (or intercept). To unpack the parameters we define the following function:
import numpy as np
def unpack_parameters(pars):
w = pars[:n_inpt * n_output].reshape((n_inpt, n_output))
b = pars[n_inpt * n_output:].reshape((1, n_output))
return w, b
We assume that inputs to our model will be an array of size (n, d)
, where
n
refers to the number of samples and d
to the dimensionality. Given
some input we can then make predictions with the following function:
def predict(parameters, inpt):
w, b = unpack_parameters(parameters)
before_softmax = np.dot(inpt, w) + b
softmaxed = np.exp(before_softmax - before_softmax.max(axis=1)[:, np.newaxis])
return softmaxed / softmaxed.sum(axis=1)[:, np.newaxis]
For multiclass classification, we use the cross entropy loss:
def loss(parameters, inpt, targets):
predictions = predict(parameters, inpt)
loss = -np.log(predictions) * targets
return loss.sum(axis=1).mean()
Gradient-based optimization requires not only the loss but also the first derivative with respect to the parameters. That gradient function has to return the gradients aligned with the parameters, which is why we concatenate them into a big array after we flattened out the weight matrix:
def d_loss_wrt_pars(parameters, inpt, targets):
p = predict(parameters, inpt)
g_w = np.dot(inpt.T, p - targets) / inpt.shape[0]
g_b = (p - targets).mean(axis=0)
return np.concatenate([g_w.flatten(), g_b])
Although this implementation can be optimized with no doubt, it suffices for this tutorial.
An Array for the Parameters¶
First we will need to allocate a region of memory where our parameters live. Climin tries to allocate as little additional memory as possible and will thus work inplace most of the time. After each optimization iteration, the current solution will always be in the array we created. This lets the user control as much as possible. We create an empty array for our solution:
import numpy as np
wrt = np.empty(7850)
where the 7850
refers to the dimensionality of our problem. We picked this
number because we will be tackling the MNIST data set. It makes sense to
initialize the parameters randomly (depending on the problem), even though the
convexity of logistic regressions guarantees that we will always find the
minimum. Climin offers convenience functions in its initialize
module:
import climin.initialize
climin.initialize.randomize_normal(wrt, 0, 1)
This will populated the parameters with values drawn from a standard normal dostribution.
Using data¶
Now that we have set up our model and loss and initialized the parameters, we need to manage the data.
In climin, we will always look at streams of data. Even if we do batch learning (as we do here), the recommended way of doing so is a repeating stream of the same data. How does that stream look? In Python, we have a convenient data structure which is the iterator. It can be thought of as a lazy list of infinite length.
The climin API expects that the loss function (and the gradient function) will
accept the parameter array as the first argument. All further arguments can be
as the user wants. When we initialize an optimizer, a keyword argument args
can be specified. This is expected to be an iterator which yields pairs of
(a, kw)
which are then passed to the loss as
f(parameters, *a, **kw)
and fprime(parameters, *a, *kw)
in case of the
derivative.
We will be using the MNIST data set , which can be downloaded from here. We will first load it and convert the target variables to a one-of-k representation, which is what our loss functions expect:
# You can get this at http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz
datafile = 'mnist.pkl.gz'
# Load data.
with gzip.open(datafile,'rb') as f:
train_set, val_set, test_set = cPickle.load(f)
X, Z = train_set
VX, VZ = val_set
TX, TZ = test_set
def one_hot(arr):
result = np.zeros((arr.shape[0], 10))
result[xrange(arr.shape[0]), arr] = 1.
return result
Z = one_hot(Z)
VZ = one_hot(VZ)
TZ = one_hot(TZ)
To create our data stream, we will just repeat the training data (X, Z)
:
import itertools
args = itertools.repeat(([X, Z], {}))
This certainly seems like overkill for logistic regression. Yet, even this simple model can often be sped up by estimating the gradients on “mini batches”. Going even further, you might want to have a continuous stream that is read from the network, a data set that does not fit into RAM or which you want to transform on the fly. All these things can be elegantly implemented with iterators.
Creating an Optimizer¶
Now that we have set everything up, we are ready to create our first
optimizer, a GradientDescent
object:
import climin
opt = climin.GradientDescent(wrt, d_loss_wrt_pars, step_rate=0.1, momentum=.95, args=args)
We created a new object called opt
. For initialization, we passed it
several parameters:
- The parameters
wrt
. This will always be the first argument to any optimizer in climin.- The derivative
d_loss_wrt_pars
; we do not needloss
itself for gradient descent.- A scalar to multiply the negative gradient with for the next search step,
step_rate
. This parameter is often referred to as learning rate in the literature.- A momentum term
momentum
to speed up learning.- Our data stream
args
.
The parameters wrt
and args
are consistent over optimizers. All others
may vary wildly, according to what an optimizer expects.
Optimization as Iteration¶
Many optimization algorithms are iterative and so are all in climin. To transfer this metaphor into programming code, optimization with climin is as simple as iterating over our optimizer object:
for i in opt: # Infinite loop!
pass
This will result in an infinite loop. Climin does not handle stopping from within optimizer objects; instead, you will have to do it manually, since you know it much better. Let’s iterate for a fixed number of iterations, say 100:
print loss(wrt, VX, VZ) # prints something like 2.49771627484
for info in opt:
if info['n_iter'] >= 100:
break
print loss(wrt, VX, VZ) # prints something like 0.324243334583
When we iteratore over the optimizer, we iterate over dictionaries. Each of these contains various information about the current state of the optimizer. The exact contents depend on the optimizer, but might contain the last step, gradient, etc. Here, we check the number of iterations that have already been performed.
Using different optimizers¶
The whole point of climin is to use different optimizers. How that goes, we will explain now. We have already seen Gradient Descent. Furthermore, there are Quasi-Newton (BFGS, L-BFGS), Conjugate Gradients, Resilient Propagation and rmsprop. Let’s see how we can use each of them.
L-BFGS, RPROP and nonlinear conjugate Gradients all have the benefit that they work reasonably well without too much tuning of their hyper parameters. We can thus construct optimizers like this:
ncg = climin.NonlinearConjugateGradient(wrt, loss, d_loss_wrt_pars, args=args)
lbfgs = climin.Lbfgs(wrt, loss, d_loss_wrt_pars, args=args)
rprop = climin.Rprop(wrt, d_loss_wrt_pars, args=args)
rmsprop = climin.RmsProp(wrt, d_loss_wrt_pars, steprate=1e-4, decay=0.9, args=args)
As you can see, we now need to specify the loss function itself in case of
Lbfgs
and NonlinearConjugateGradient
. That is because both utilize a
line search after finding a search direction.
climin.RmsProp
has more hyper parameters and needs more fine grained tuning.
Yet, climin.GradientDescent
and climin.RmsProp
work naturally with so a
stochastic estimate of the objective and work very well with mini batches. For
more details, see Advanced data handling.
Conclusion and Next Steps¶
This tutorial explained the basic functionality of climin. There is a lot more to explore to fully leverage the functionality of this library. Check the table of contents and the examples directory of your climin checkout.
Dealing with parameters¶
Parameters in climin are a long and one dimensional array. This might seem as a restriction at first, yet it makes things easier in other places. Consider a model involving complicated array dimensionalities; now consider how higher order derivatives of those might look like. Yes that’s right, a pretty messy thing. Furthermore, letting paramters occupy consecutive regions of memory has further advantages from an implementation point of view. We can easier write it to disk, randomize its contents or similar things.
Creating parameter sets¶
Creating of parameter arrays need not be tedious, though. Climin comes with a
nice convenience function, climin.util.empty_with_views
which does most of the
work. You just need to feed it the shapes of parameters you are interested in.
Let us use logistic regressiom from the Tutorial and see where it comes in handy. First, we will create a parameter array and the various views according to a template:
import numpy as np
import climin.util
tmpl = [(784, 10), 10] # w is matrix and b a vector
flat, (w, b) = climin.util.empty_with_views(tmpl)
Now, flat
is a one dimensional array. w
and b
are a two dimensional
and a one dimensional array respectively. They share memory with flat
, so
any change we will do in w
or b
will be reflected in flat
and vice
versa. In order for a predict function to get the parameters out of the flat
array, there is climin.util.shaped_from_flat
which does the same job as
empty_with_views
, except that it receives flat
and does not create it.
In fact, the latter uses the former internally.
Let’s adapt the predict
function to use w
and b
instead:
def predict(parameters, inpt):
w, b = climin.util.shaped_from_flat(parameters, tmpl)
before_softmax = np.dot(inpt, w) + b
softmaxed = np.exp(before_softmax - before_softmax.max(axis=1)[:, np.newaxis])
return softmaxed / softmaxed.sum(axis=1)[:, np.newaxis]
This might seem like overkill for logistic regression, but becomes invaluable when complicated models with many different parameters are used.
Calculating derivatives in place¶
When calculating derivatives, you can make use of this as well–which is important because climin expects derivatives to be flat as well, nicely aligned with the parameter array:
def f_d_loss_wrt_pars(parameters, inpt, targets):
p = predict(parameters, inpt)
d_flat, d_w, d_b = climin.util.empty_with_views(tmpl)
d_w[...] = np.dot(inpt.T, p - targets) / inpt.shape[0]
d_b[...] = (p - targets).mean(axis=0)
return d_flat
What are we doing here? First, we get ourselves a new array and preshaped views
on it in the same way as the parameters. Then we overwrite the views in place
with the derivatives and finally return the flat array as a result.
The in place assignment is important. If we did it using d_w = ...
, Python
would just reassign the name and the changes would not turn up in d_flat
.
As a further note, np.dot
supports an extra argument out
which specifies
where to write the result. To safe memory, we could perform the following
instead:
np.dot(inpt.T, p - targets, out=d_w)
d_w /= inpt.shape[0]
Initializing parameters¶
Initializing parameters with empty values is asking for trouble. You probably want to populate an array with random numbers or zeros. Of course it is easy to do so by hand:
flat[...] = np.random.normal(0, 0.1, flat.shape)
We found this quite tedious to write though; especially as soon as flat becomes the field of a nested object. Thus, we have a short hand in the initialize module which does exaclty that:
import climin.initialize
climin.initialize.randomize_normal(flat, 0, 0.1)
There are more functions to do similar things. Check out Initialization of Parameters.
Advanced data handling¶
We have already seen how to use batch learning in the Tutorial. The use
of itertools.repeat
to handle simple batch learning seemed rather over
expressive; it makes more sense if we consider more advanced use cases.
Mini batches¶
Consider the case where we do not want to calculcate the full gradient in each
iteration but estimate it given a mini batch. The contract of climin is that
each item of the args
iterator will be used in a single iteration and thus
for one parameter update. The way to enable mini batch learning is thus to
provide an args
argument which is an infinite list of mini batches.
Let’s revisit our example of logistic regression. Here, we created the args
iterator using itertools.repeat
on the same array again and again:
import itertools
args = itertools.repeat(([X, Z], {}))
What we want to do now is to have an infinite stream of slices of X
and
Z
. How do we access the n’th batch of X
and Z
? We offer you a
convenience function that will give you random (with or without replacement)
slices from a container:
batch_size = 100
args = ((i, {}) for i in climin.util.iter_minibatches([train_set_x, train_set_y], batch_size, [0, 0]))
The last argument, [0, 0]
gives the axes along which to slice [X, Z]
.
In some cases, samples might be aligned along axis 0
for the input, but
along axis 1
in the target data.
External memory¶
What is nice about climin.util.iter_minibatches
is that it needs only slices
as a requirement for its arguments. We therefore only need to pass it a data
structure which reads data from disk as soon as it is needed and disposes of it
as soon as it is not any more.
HDF5 and its python package h5py are a perfect match for this. We have managed to use 6+ GB sized image data sets on GPUs with less than 2 GB of RAM with this simple recipe:
import climin.util
import gnumpy
import h5py
f = h5py.File('data.h5')
ds = f['inpts']
args = climin.util.iter_minibatches([ds], 100, [0])
args = (gnumpy.garray(i) for i in args)
# ...
This is in general not restricted by the size of the data set; it just show that going beyond the GPU RAM limit is achieved very naturally in climin.
Further usages¶
This architecture can be exploited in many different ways. E.g., a stream over a network can be directly used. A single pass over a file without keeping more than necessary is another option.
Interrupting and Resuming via Checkpoints¶
It is important to be able to interrupt optimization and continue right from where you left off. Reasons include scheduling on shared resources, branching the optimization with different settings or securing yourself against crashes in long-running processes.
Note
Currently, this is not supported by all optimizers. It is the case for gradient descent, rmsprop, adadelta and rprop.
Climin makes this in parts possible and leaves the responsibility to the user in
other parts. More specifically, the user has to take over the serialization of
the parameter vector (i.e. wrt
), the objective function and its derivatives
(e.g. fprime
) and the data (i.e. args
). The reason for this is that
one cannot build a generic procedure for this. The data might be depending
on an open file descriptor and only a subset of Python functions can be
serialized, which is those that are defined at the top level.
Saving the state to disk (or somewhere else)¶
The idea is that the info
dictionary which is the result of each
optimization step carries all the information necesseray to resume. Thus a
recipe to write your state to disk is as follows.
import numpy as np
import cPickle
from climin.gd import Gradient Descent
pars = make_pars()
fprime = make_fprime()
data = make_data()
opt = GradientDescent(pars, fprime, args=data)
for info in opt:
with open('state.pkl', 'w') as fp:
cPickle.dump(info, fp)
np.savetxt('parameters.csv', pars)
This snippet first generates the necessery quantities from library functions
which we assume given. We then create a GradientDescent
object over which we
iterate to optimize. In each iteration, we pickle the info dictionary to disk.
Note
Pickling an info dictionary directly to disk might be a bad idea in many cases. E.g. it will contain the current data element or a gnumpy array, which is not picklable. It is the users’s responsibility to take care of that.
Loading the state from disk¶
We will now load the info dictionary from file, create an optimizer object an initialize it with values from the info dictionary.
import numpy as np
import cPickle
from climin.gd import Gradient Descent
pars = np.loadtxt('parameters.csv')
fprime = make_fprime()
data = make_data()
with open('state.pkl') as fp:
info = cPickle.load(fp)
opt = GradientDescent(pars, fprime, args=data)
opt.set_from_info(info)
We can continue optimization right from where we left off.
Reference¶
Optimizer overview¶
Gradient Descent¶
This module provides an implementation of gradient descent.
-
class
climin.gd.
GradientDescent
(wrt, fprime, step_rate=0.1, momentum=0.0, momentum_type='standard', args=None)¶ Classic gradient descent optimizer.
Gradient descent works by iteratively performing updates solely based on the first derivative of a problem. The gradient is calculated and multiplied with a scalar (or component wise with a vector) to do a step in the problem space. For speed ups, a technique called “momentum” is often used, which averages search steps over iterations.
Even though gradient descent is pretty simple it can be very effective if well tuned (in terms of its hyper parameters step rate and momentum). Sometimes the use of schedules for both parameters is necessary. See
climin.schedule
for basic schedules.Gradient descent is also very robust to stochasticity in the objective function. This might result from noise injected into it (e.g. in the case of denoising auto encoders) or because it is based on data samples (e.g. in the case of stochastic mini batches.)
Given a step rate \(\alpha\) and a function \(f'\) to evaluate the search direction the current paramters \(\theta_t\) the following update is performed:
\[\begin{split}v_{t+1} &= \alpha f'(\theta_t) \\ \theta_{t+1} &= \theta_t - v_{t+1}.\end{split}\]If we also have a momentum \(\beta\) and are using standard momentum, we update the parameters according to:
\[\begin{split}v_{t+1} &= \alpha f'(\theta_t) + \beta v_{t} \\ \theta_{t+1} &= \theta_t - v_{t+1}\end{split}\]In some cases (e.g. learning the parameters of deep networks), using Nesterov momentum can be beneficial. In this case, we first make a momentum step and then evaluate the gradient at the location in between. Thus, there is an additional cost of an addition of the parameters.
\[\begin{split}\theta_{t+{1 \over 2}} &= \theta_t - \beta v_t \\ v_{t+1} &= \alpha f'(\theta_{t + {1 \over 2}}) \\ \theta_{t+1} &= \theta_t - v_{t+1}\end{split}\]which can be specified additionally by the initialization argument
momentum_type
.Note
Works with gnumpy.
Attributes
wrt (array_like) Current solution to the problem. Can be given as a first argument to .fprime
.fprime (Callable) First derivative of the objective function. Returns an array of the same shape as .wrt
.step_rate (float or array_like) Step rate to multiply the gradients with. momentum (float or array_like) Momentum to multiply previous steps with. momentum_type (string (either “standard” or “nesterov”)) When to add the momentum term to the parameter vector; in the first case it will be done after the calculation of the gradient, in the latter before. Methods
-
__init__
(wrt, fprime, step_rate=0.1, momentum=0.0, momentum_type='standard', args=None)¶ Create a GradientDescent object.
Parameters: wrt : array_like
Array that represents the solution. Will be operated upon in place.
fprime
should accept this array as a first argument.fprime : callable
step_rate : float or array_like, or iterable of that
Step rate to use during optimization. Can be given as a single scalar value or as an array for a different step rate of each parameter of the problem.
Can also be given as an iterator; in that case, every iteration of the optimization takes a new element as a step rate from that iterator.
momentum : float or array_like, or iterable of that
Momentum to use during optimization. Can be specified analoguously (but independent of) step rate.
momentum_type : string (either “standard” or “nesterov”)
When to add the momentum term to the paramter vector; in the first case it will be done after the calculation of the gradient, in the latter before.
args : iterable
Iterator of arguments which
fprime
will be called with.
-
rmsprop¶
This module provides an implementation of rmsprop.
-
class
climin.rmsprop.
RmsProp
(wrt, fprime, step_rate, decay=0.9, momentum=0, step_adapt=False, step_rate_min=0, step_rate_max=inf, args=None)¶ RmsProp optimizer.
RmsProp [tieleman2012rmsprop] is an optimizer that utilizes the magnitude of recent gradients to normalize the gradients. We always keep a moving average over the root mean squared (hence Rms) gradients, by which we divide the current gradient. Let \(f'(\theta_t)\) be the derivative of the loss with respect to the parameters at time step \(t\). In its basic form, given a step rate \(\alpha\) and a decay term \(\gamma\) we perform the following updates:
\[\begin{split}r_t &=& (1 - \gamma)~f'(\theta_t)^2 + \gamma r_{t-1} , \\ v_{t+1} &=& {\alpha \over \sqrt{r_t}} f'(\theta_t), \\ \theta_{t+1} &=& \theta_t - v_{t+1}.\end{split}\]In some cases, adding a momentum term \(\beta\) is beneficial. Here, Nesterov momentum is used:
\[\begin{split}\theta_{t+{1 \over 2}} &=& \theta_t - \beta v_t, \\ r_t &=& (1 - \gamma)~f'(\theta_{t + {1 \over 2}})^2 + \gamma r_{t-1}, \\ v_{t+1} &=& \beta v_t + {\alpha \over \sqrt{r_t}} f'(\theta_{t + {1 \over 2}}), \\ \theta_{t+1} &=& \theta_t - v_{t+1}\end{split}\]Additionally, this implementation has adaptable step rates. As soon as the components of the step and the momentum point into the same direction (thus have the same sign) the step rate for that parameter is multiplied with
1 + step_adapt
. Otherwise, it is multiplied with1 - step_adapt
. In any way, the minimum and maximum step ratesstep_rate_min
andstep_rate_max
are respected and exceeding values truncated to it.RmsProp has several advantages; for one, it is a very robust optimizer which has pseudo curvature information. Additionally, it can deal with stochastic objectives very nicely, making it applicable to mini batch learning.
Note
Works with gnumpy.
[tieleman2012rmsprop] Tieleman, T. and Hinton, G. (2012), Lecture 6.5 - rmsprop, COURSERA: Neural Networks for Machine Learning Attributes
wrt (array_like) Current solution to the problem. Can be given as a first argument to .fprime
.fprime (Callable) First derivative of the objective function. Returns an array of the same shape as .wrt
.step_rate (float or array_like) Step rate of the optimizer. If an array, means that per parameter step rates are used. momentum (float or array_like) Momentum of the optimizer. If an array, means that per parameter momentums are used. step_adapt (float or bool) Constant to adapt step rates. If False, step rate adaption is not done. step_rate_min (float, optional, default 0) When adapting step rates, do not move below this value. step_rate_max (float, optional, default inf) When adapting step rates, do not move above this value. Methods
-
__init__
(wrt, fprime, step_rate, decay=0.9, momentum=0, step_adapt=False, step_rate_min=0, step_rate_max=inf, args=None)¶ Create an RmsProp object.
Parameters: wrt : array_like
Array that represents the solution. Will be operated upon in place.
fprime
should accept this array as a first argument.fprime : callable
step_rate : float or array_like
Step rate to use during optimization. Can be given as a single scalar value or as an array for a different step rate of each parameter of the problem.
decay : float
Decay parameter for the moving average. Must lie in [0, 1) where lower numbers means a shorter “memory”.
momentum : float or array_like
Momentum to use during optimization. Can be specified analoguously (but independent of) step rate.
step_adapt : float or bool
Constant to adapt step rates. If False, step rate adaption is not done.
step_rate_min : float, optional, default 0
When adapting step rates, do not move below this value.
step_rate_max : float, optional, default inf
When adapting step rates, do not move above this value.
args : iterable
Iterator over arguments which
fprime
will be called with.
-
Adadelta¶
This module provides an implementation of adadelta.
-
class
climin.adadelta.
Adadelta
(wrt, fprime, step_rate=1, decay=0.9, momentum=0, offset=0.0001, args=None)¶ Adadelta optimizer.
Adadelta [zeiler2013adadelta] is a method that uses the magnitude of recent gradients and steps to obtain an adaptive step rate. An exponential moving average over the gradients and steps is kept; a scale of the learning rate is then obtained by their ration.
Let \(f'(\theta_t)\) be the derivative of the loss with respect to the parameters at time step \(t\). In its basic form, given a step rate \(\alpha\), a decay term \(\gamma\) and an offset \(\epsilon\) we perform the following updates:
\[\begin{split}g_t &=& (1 - \gamma)~f'(\theta_t)^2 + \gamma g_{t-1}\end{split}\]where \(g_0 = 0\). Let \(s_0 = 0\) for updating the parameters:
\[\begin{split}\Delta \theta_t &=& \alpha {\sqrt{s_{t-1} + \epsilon} \over \sqrt{g_t + \epsilon}}~f'(\theta_t), \\ \theta_{t+1} &=& \theta_t - \Delta \theta_t.\end{split}\]Subsequently we adapt the moving average of the steps:
\[\begin{split}s_t &=& (1 - \gamma)~\Delta\theta_t^2 + \gamma s_{t-1}.\end{split}\]To extend this with Nesterov’s accelerated gradient, we need a momentum coefficient \(\beta\) and incorporate it by using slightly different formulas:
\[\begin{split}\theta_{t + {1 \over 2}} &=& \theta_t - \beta \Delta \theta_{t-1}, \\ g_t &=& (1 - \gamma)~f'(\theta_{t + {1 \over 2}})^2 + \gamma g_{t-1}, \\ \Delta \theta_t &=& \alpha {\sqrt{s_{t-1} + \epsilon} \over \sqrt{g_t + \epsilon}}~f'(\theta_{t + {1 \over 2}}).\end{split}\]In its original formulation, the case \(\alpha = 1, \beta = 0\) was considered only.
[zeiler2013adadelta] Zeiler, Matthew D. “ADADELTA: An adaptive learning rate method.” arXiv preprint arXiv:1212.5701 (2012). Methods
-
__init__
(wrt, fprime, step_rate=1, decay=0.9, momentum=0, offset=0.0001, args=None)¶ Create an Adadelta object.
Parameters: wrt : array_like
Array that represents the solution. Will be operated upon in place.
fprime
should accept this array as a first argument.fprime : callable
step_rate : scalar or array_like, optional [default: 1]
Value to multiply steps with before they are applied to the parameter vector.
decay : float, optional [default: 0.9]
Decay parameter for the moving average. Must lie in [0, 1) where lower numbers means a shorter “memory”.
momentum : float or array_like, optional [default: 0]
Momentum to use during optimization. Can be specified analoguously (but independent of) step rate.
offset : float, optional, [default: 1e-4]
Before taking the square root of the running averages, this offset is added.
args : iterable
Iterator over arguments which
fprime
will be called with.
-
Adam¶
This module provides an implementation of Adam.
-
class
climin.adam.
Adam
(wrt, fprime, step_rate=0.0002, decay=None, decay_mom1=0.1, decay_mom2=0.001, momentum=0, offset=1e-08, args=None)¶ Adaptive moment estimation optimizer. (Adam).
Adam is a method for the optimization of stochastic objective functions.
The idea is to estimate the first two moments with exponentially decaying running averages. Additionally, these estimates are bias corrected which improves over the initial learning steps since both estimates are initialized with zeros.
The rest of the documentation follows the original paper [adam2014] and is only meant as a quick primer. We refer to the original source for more details, such as results on convergence and discussion of the various hyper parameters.
Let \(f_t'(\theta_t)\) be the derivative of the loss with respect to the parameters at time step \(t\). In its basic form, given a step rate \(\alpha\), decay terms \(\beta_1\) and \(\beta_2\) for the first and second moment estimates respectively and an offset \(\epsilon\) we initialise the following quantities
\[\begin{split}m_0 & \leftarrow 0 \\ v_0 & \leftarrow 0 \\ t & \leftarrow 0 \\\end{split}\]and perform the following updates:
\[\begin{split}t & \leftarrow t + 1 \\ g_t & \leftarrow f_t'(\theta_{t-1}) \\ m_t & \leftarrow \beta_1 \cdot g_t + (1 - \beta_1) \cdot m_{t-1} \\ v_t &\leftarrow \beta_2 \cdot g_t^2 + (1 - \beta_2) \cdot v_{t-1}\end{split}\]\[\begin{split}\hat{m}_t &\leftarrow {m_t \over (1 - (1 - \beta_1)^t)} \\ \hat{v}_t &\leftarrow {v_t \over (1 - (1 - \beta_2)^t)} \\ \theta_t &\leftarrow \theta_{t-1} - \alpha {\hat{m}_t \over (\sqrt{\hat{v}_t} + \epsilon)}\end{split}\]As suggested in the original paper, the last three steps are optimized for efficieny by using:
\[\begin{split}\alpha_t &\leftarrow \alpha {\sqrt{(1 - (1 - \beta_2)^t)} \over (1 - (1 - \beta_1)^t)} \\ \theta_t &\leftarrow \theta_{t-1} - \alpha_t {m_t \over (\sqrt{v_t} + \epsilon)}\end{split}\]The quantities in the algorithm and their corresponding attributes in the optimizer object are as follows.
Symbol Attribute Meaning \(t\) n_iter
Number of iterations, starting at 0. \(m_t\) est_mom_1_b
Biased estimate of first moment. \(v_t\) est_mom_2_b
Biased estimate of second moment. \(\hat{m}_t\) est_mom_1
Unbiased estimate of first moment. \(\hat{v}_t\) est_mom_2
Unbiased estimate of second moment. \(\alpha\) step_rate
Step rate parameter. \(\beta_1\) decay_mom1
Exponential decay parameter for first moment estimate. \(\beta_2\) decay_mom2
Exponential decay parameter for second moment estimate. \(\epsilon\) offset
Safety offset for division by estimate of second moment. Additionally, using Nesterov momentum is possible by setting the momentum attribute of the optimizer to a value other than 0. We apply the momentum step before computing the gradient, resulting in a similar incorporation of Nesterov momentum in Adam as presented in [nadam2015].
Note
The use of decay parameters \(\beta_1\) and \(\beta_2\) differs from the definition in the original paper [adam2014]: With \(\beta^{\ast}_i\) referring to the parameters as defined in the paper, we use \(\beta_i\) with \(\beta_i = 1 - \beta^{\ast}_i\)
[adam2014] (1, 2) Kingma, Diederik, and Jimmy Ba. “Adam: A Method for Stochastic Optimization.” arXiv preprint arXiv:1412.6980 (2014). [nadam2015] Dozat, Timothy “Incorporating Nesterov Momentum into Adam.” Stanford University, Tech. Rep (2015). Methods
-
__init__
(wrt, fprime, step_rate=0.0002, decay=None, decay_mom1=0.1, decay_mom2=0.001, momentum=0, offset=1e-08, args=None)¶ Create an Adam object.
Parameters: wrt : array_like
Array that represents the solution. Will be operated upon in place.
fprime
should accept this array as a first argument.fprime : callable
step_rate : scalar or array_like, optional [default: 1]
Value to multiply steps with before they are applied to the parameter vector.
decay_mom1 : float, optional, [default: 0.1]
Decay parameter for the exponential moving average estimate of the first moment.
decay_mom2 : float, optional, [default: 0.001]
Decay parameter for the exponential moving average estimate of the second moment.
momentum : float or array_like, optional [default: 0]
Momentum to use during optimization. Can be specified analogously (but independent of) step rate.
offset : float, optional, [default: 1e-8]
Before taking the square root of the running averages, this offset is added.
args : iterable
Iterator over arguments which
fprime
will be called with.
-
Resilient Propagation¶
This module contains the Resilient propagation optimizer.
-
class
climin.rprop.
Rprop
(wrt, fprime, step_shrink=0.5, step_grow=1.2, min_step=1e-06, max_step=1, changes_max=0.1, args=None)¶ Rprop optimizer.
Resilient propagation is an optimizer that was originally tailored towards neural networks. It can however be savely applied to all kinds of optimization problems. The idea is to have a parameter specific step rate which is determined by sign changes of the derivative of the objective function.
To be more precise, given the derivative of the loss given the parameters \(f'(\theta_t)\) at time step \(t\), the \(i\) th component of the vector of steprates \(\alpha\) is determined as
\[\begin{split}\alpha_i \leftarrow \begin{cases} \alpha_i \cdot \eta_{\text{grow}} ~\text{if}~ f'(\theta_t)_i \cdot f'(\theta_{t-1})_i > 0 \\ \alpha_i \cdot \eta_{\text{shrink}} ~\text{if}~ f'(\theta_t)_i \cdot f'(\theta_{t-1})_i < 0 \\ \alpha_i \end{cases}\end{split}\]where \(0 < \eta_{\text{shrink}} < 1 < \eta_{\text{grow}}\) specifies the shrink and growth rates of the step rates. Typically, we will threshold the step rates at minimum and maximum values.
The parameters are then adapted according to the sign of the error gradient:
\[\theta_{t+1} = -\alpha~\text{sgn}(f'(\theta_t)).\]This results in a method which is quite robust. On the other hand, it is more sensitive towards stochastic objectives, since that stochasticity might lead to bad estimates of the sign of the gradient.
Note
Works with gnumpy.
[riedmiller1992rprop] M. Riedmiller und Heinrich Braun: Rprop - A Fast Adaptive Learning Algorithm. Proceedings of the International Symposium on Computer and Information Science VII, 1992 Attributes
wrt (array_like) Current solution to the problem. Can be given as a first argument to .fprime
.fprime (Callable) First derivative of the objective function. Returns an array of the same shape as .wrt
.step_shrink (float) Constant to shrink step rates by if the gradients of the error do not agree over time. step_grow (float) Constant to grow step rates by if the gradients of the error do agree over time. min_step (float) Minimum step rate. max_step (float) Maximum step rate. Methods
-
__init__
(wrt, fprime, step_shrink=0.5, step_grow=1.2, min_step=1e-06, max_step=1, changes_max=0.1, args=None)¶ Create an Rprop object.
Parameters: wrt : array_like
Current solution to the problem. Can be given as a first argument to
.fprime
.fprime : Callable
First derivative of the objective function. Returns an array of the same shape as
.wrt
.step_shrink : float
Constant to shrink step rates by if the gradients of the error do not agree over time.
step_grow : float
Constant to grow step rates by if the gradients of the error do agree over time.
min_step : float
Minimum step rate.
max_step : float
Maximum step rate.
args : iterable
Iterator over arguments which
fprime
will be called with.
-
Conjugate Gradients¶
Module containing functionality for conjugate gradients.
Conjugate gradients is motivated from a first order Taylor expansion of the objective:
To locally decrease the objective, it is optimal to set \(d_t \propto -f'(\theta_t)\) and find \(\alpha_t\) with a line search algorithm, which is known as steepest descent. Yet, a well known disadvantage of this approach is that directions found at \(t\) will often interfere with directions found for \(t' < t\).
The solution to this problem is to chose \(d_t\) in a way that it does not interfere with previous updates. If the dimensions of our problem were independent, we could just move along these dimensions. If they were independent up to rotation, we would have to chose directions which are orthogonal to each other. This is exactly the case when the Hessian of the problem, \(A\) is diagonal. If it is not diagonal, we have to move along directions which are called conjugate to each other with respect to the matrix \(A\).
The conjugate gradients algorithms provide methods to do so efficiently. The linear conjugate gradients algorithm assumes that the objective is a quadratic and can thus determine \(\alpha\) exactly. Nonlinear conjugate gradients works on arbitrary functions (yet, the Taylor expansion assumption above has to be reasonable). Since the Hessian \(A\) is not constant in this case, the previous directions (to which a new direction has to be conjugate) have to be reset from time to time. Additionally, we need to perform a line search to solve for \(\alpha_t\).
-
class
climin.cg.
ConjugateGradient
(wrt, H=None, b=None, f_Hp=None, min_grad=1e-14, precond=None)¶ ConjugateGradient class.
Minimize a quadratic objective of the form
\[f(\theta) = {1 \over 2} \theta^TA\theta + \theta^Tb + c.\]The minimization will take place by moving along conjugate directions of steepest decrease in the error. This will take at most as many steps as the dimensionality of the problem.
Note
In most cases it is better to use
scipy.optimize.solve
. Only use this function if you want to monitor intermediate quantities and are not entirely interested in optimization of a quadratic objective, but in a different error measure. E.g. as in Hessian free optimization.Attributes
wrt (array_like) Parameters of the problem. H (array_like, 2 dimensional, square) Curvature term of the quadratic, the Hessian. b (array_like) Linear term of the quadratic. f_Hp (callable) Function to calculcate the dot product of a Hessian with an arbitrary vector. min_grad (float, optional, default: 1e-14) If all components of the gradient fall below this threshold, stop optimization. precond (array_like) Matrix to precondition the problem. If a vector, is taken to represent a diagonal matrix. Methods
-
__init__
(wrt, H=None, b=None, f_Hp=None, min_grad=1e-14, precond=None)¶ Create a ConjugateGradient object.
Parameters: wrt : array_like
Parameters of the problem.
H : array_like, 2 dimensional, square
Curvature term of the quadratic, the Hessian.
b : array_like
Linear term of the quadratic.
f_Hp : callable
Function to calculcate the dot product of a Hessian with an arbitrary vector.
min_grad : float, optional, default: 1e-14
If all components of the gradient fall below this threshold, stop optimization.
precond : array_like
Matrix to precondition the problem. If a vector, is taken to represent a diagonal matrix.
-
-
class
climin.cg.
NonlinearConjugateGradient
(wrt, f, fprime, min_grad=1e-06, args=None)¶ NonlinearConjugateGradient optimizer.
NCG minimizes functions by following directions which are conjugate to each other with respect to the Hessian. Since the curvature changes if the objective is nonquadratic, the Hessian will not be accurate and thus the conjugacy of successive search directions as well. Furthermore, the optimal step length cannot be found in closed form and has to be obtained with a line search.
During optimization, we sometimes perform a restart. That means we give up on trying to find conjugate directions and use the gradient as a new search direction. This is done whenever two successive directions are far from orthogonal, an indicator that the quadratic assumption is either inaccurate or the Hessian has changed too much lately.
Attributes
wrt (array_like) Array of parameters of the problem. f (callable) Objective function. fprime (callable) First derivative of the objective function. min_grad (float) If all components of the gradient fall below this value, stop minimization. line_search (LineSearch object.) Line search object to perform line searches with. args (iterable) Iterable of arguments passed on to the objective function and its derivatives. Methods
-
__init__
(wrt, f, fprime, min_grad=1e-06, args=None)¶ Create a NonlinearConjugateGradient object.
Parameters: wrt : array_like
Array of parameters of the problem.
f : callable
Objective function.
fprime : callable
First derivative of the objective function.
min_grad : float
If all components of the gradient fall below this value, stop minimization.
args : iterable, optional
Iterable of arguments passed on to the objective function and its derivatives.
-
Quasi-Newton (BFGS, L-BFGS)¶
This module provides an implementation of Quasi-Newton methods (BFGS, sBFGS and l-BFGS).
The Taylor expansion up to second order of a function \(f(\theta_t)\) allows a local quadratic approximiation of \(f(\theta_t + d_t)\):
\[f(\theta_t + d_t) \approx f(\theta_t) + d_t^Tf'(\theta_t) + \frac{1}{2}d_t^TH_td_t\]
where the symmetric positive definite matrix \(H_t\) is the Hessian at \(\theta_t\). The minimizer \(d_t\) of this convex quadratic model is:
\[d_t = -H^{-1}f'(\theta_t).\]
For large scale problems both computing/storing the Hessian and solving the above linear system is computationally demanding. Instead of recomputing the Hessian from scratch at every iteration, quasi-Newton methods utilize successive measurements of the gradient to build a sufficiently good quadratic model of the objective function. The above formula is then applied to yield a direction \(d_t\). The update done is then of the form
\[\theta_{t+1} = \alpha_t d_t + \theta_t\]
where \(\alpha_t\) is obtained with a line search.
Note
The classes presented here are not working with gnumpy.
-
class
climin.bfgs.
Bfgs
(wrt, f, fprime, initial_inv_hessian=None, line_search=None, args=None)¶ BFGS (Broyden-Fletcher-Goldfarb-Shanno) is one of the most well-knwon quasi-Newton methods. The main idea is to iteratively construct an approximate inverse Hessian \(B^{-1}_t\) by a rank-2 update:
\[B^{-1}_{t+1} = B^{-1}_t + (1 + \frac{y_t^TB^{-1}_ty_t}{y_t^Ts_t})\frac{s_ts_t^T}{s_t^Ty_t} - \frac{s_ty_t^TB^{-1}_t + B^{-1}_ty_ts_t^T}{s_t^Ty_t},\]where \(y_t = f(\theta_{t+1}) - f(\theta_{t})\) and \(s_t = \theta_{t+1} - \theta_t\).
The storage requirements for BFGS scale quadratically with the number of variables. For detailed derivations, see [nocedal2006a], chapter 6.
[nocedal2006a] (1, 2) Nocedal, J. and Wright, S. (2006), Numerical Optimization, 2nd edition, Springer. Attributes
wrt (array_like) Current solution to the problem. Can be given as a first argument to .f
and.fprime
.f (Callable) The object function. fprime (Callable) First derivative of the objective function. Returns an array of the same shape as .wrt
.initial_inv_hessian (array_like) The initial estimate of the approximiate Hessian. line_search (LineSearch object.) Line search object to perform line searches with. args (iterable) Iterator over arguments which fprime
will be called with.Methods
-
__init__
(wrt, f, fprime, initial_inv_hessian=None, line_search=None, args=None)¶ Create a BFGS object.
Parameters: wrt : array_like
Array that represents the solution. Will be operated upon in place.
f
andfprime
should accept this array as a first argument.f : callable
The objective function.
fprime : callable
initial_inv_hessian : array_like
The initial estimate of the approximiate Hessian.
line_search : LineSearch object.
Line search object to perform line searches with.
args : iterable
Iterator over arguments which
fprime
will be called with.
-
-
class
climin.bfgs.
Lbfgs
(wrt, f, fprime, initial_hessian_diag=1, n_factors=10, line_search=None, args=None)¶ l-BFGS (limited-memory BFGS) is a limited memory variation of the well-known BFGS algorithm. The storage requirement for BFGS scale quadratically with the number of variables, and thus it tends to be used only for smaller problems. Limited-memory BFGS reduces the storage by only using the \(l\) latest updates (factors) in computing the approximate Hessian inverse and representing this approximation only implicitly. More specifically, it stores the last \(l\) BFGS update vectors \(y_t\) and \(s_t\) and uses these to implicitly perform the matrix operations of BFGS (see [nocedal2006a]).
Note
In order to handle simple box constraints, consider
scipy.optimize.fmin_l_bfgs_b
.Attributes
wrt (array_like) Current solution to the problem. Can be given as a first argument to .f
and.fprime
.f (Callable) The object function. fprime (Callable) First derivative of the objective function. Returns an array of the same shape as .wrt
.initial_hessian_diag (array_like) The initial estimate of the diagonal of the Hessian. n_factors (int) The number of factors that should be used to implicitly represent the inverse Hessian. line_search (LineSearch object.) Line search object to perform line searches with. args (iterable) Iterator over arguments which fprime
will be called with.Methods
-
__init__
(wrt, f, fprime, initial_hessian_diag=1, n_factors=10, line_search=None, args=None)¶ Create an Lbfgs object.
Attributes
wrt (array_like) Current solution to the problem. Can be given as a first argument to .f
and.fprime
.f (Callable) The object function. fprime (Callable) First derivative of the objective function. Returns an array of the same shape as .wrt
.initial_hessian_diag (array_like) The initial estimate of the diagonal of the Hessian. n_factors (int) The number of factors that should be used to implicitly represent the inverse Hessian. line_search (LineSearch object.) Line search object to perform line searches with. args (iterable) Iterator over arguments which fprime
will be called with.
-
Convenience, Utilities¶
Schedules¶
This module holds various schedules for parameters such as the step rate or momentum for gradient descent.
A schedule is implemented as an iterator. This allows it to have iterators
of infinite length. It also makes it possible to manipulate scheduls with
the itertools
python module, e.g. for chaining iterators.
-
climin.schedule.
decaying
(start, decay)¶ Return an iterator of exponentially decaying values.
The first value is
start
. Every further value is obtained by multiplying the last one by a factor ofdecay
.Examples
>>> from climin.schedule import decaying >>> s = decaying(10, .9) >>> [next(s) for i in range(5)] [10.0, 9.0, 8.100000000000001, 7.290000000000001, 6.561]
-
climin.schedule.
linear_annealing
(start, stop, n_steps)¶ Return an iterator that anneals linearly to a point linearly.
The first value is
start
, the last value isstop
. The annealing will be linear overn_steps
iterations. After that,stop
is yielded.Examples
>>> from climin.schedule import linear_annealing >>> s = linear_annealing(1, 0, 4) >>> [next(s) for i in range(10)] [1.0, 0.75, 0.5, 0.25, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
-
climin.schedule.
repeater
(iter, n)¶ Return an iterator that repeats each element of iter exactly n times before moving on to the next element.
Examples
>>> from climin.schedule import repeater >>> s = repeater([1, 2, 3], 2) >>> [next(s) for i in range(6)] [1, 1, 2, 2, 3, 3]
-
class
climin.schedule.
SutskeverBlend
(max_momentum, stretch=250)¶ Class representing a schedule that step-wise increases from zero to a maximum value, as described in [sutskever2013importance].
Examples
>>> from climin.schedule import SutskeverBlend >>> s = iter(SutskeverBlend(0.9, 2)) >>> [next(s) for i in range(10)] [0.5, 0.75, 0.75, 0.8333333333333333, 0.8333333333333333, 0.875, 0.875, 0.9, 0.9, 0.9]
[sutskever2013importance] On the importance of initialization and momentum in deep learning, Sutskever et al (ICML 2013)
Initialization of Parameters¶
Module that contains functionality to initialize parameters to starting values.
-
climin.initialize.
sparsify_columns
(arr, n_non_zero, keep_diagonal=False, random_state=None)¶ Set all but
n_non_zero
entries to zero for each column ofarr
.This is a common technique to find better starting points for learning deep and/or recurrent networks.
Parameters: arr : array_like, two dimensional
Array to work upon in place.
n_non_zero : integer
Amount of non zero entries to keep.
keep_diagonal : boolean, optional [default: False]
If set to True and
arr
is square, do keep the diagonal.random_state : numpy.random.RandomState object, optional [default
If set, random number generator that will generate the indices corresponding to the zero-valued columns.
Examples
>>> import numpy as np >>> from climin.initialize import sparsify_columns >>> arr = np.arange(9).reshape((3, 3)) >>> sparsify_columns(arr, 1) >>> arr array([[0, 0, 0], [0, 4, 5], [6, 0, 0]])
-
climin.initialize.
bound_spectral_radius
(arr, bound=1.2)¶ Set the spectral radius of the square matrix
arr
tobound
.This is performed by scaling eigenvalues of
arr
.Parameters: arr : array_like, two dimensional
Array to work upon in place.
bound : float, optional, default: 1.2
Examples
>>> import numpy as np >>> from climin.initialize import bound_spectral_radius >>> arr = np.arange(9).reshape((3, 3)).astype('float64') >>> bound_spectral_radius(arr, 1.1) >>> arr array([[ -7.86816957e-17, 8.98979486e-02, 1.79795897e-01], [ 2.69693846e-01, 3.59591794e-01, 4.49489743e-01], [ 5.39387691e-01, 6.29285640e-01, 7.19183588e-01]])
-
climin.initialize.
randomize_normal
(arr, loc=0, scale=1, random_state=None)¶ Populate an array with random numbers from a normal distribution with mean loc and standard deviation scale.
Parameters: arr : array_like
Array to work upon in place.
loc : float
Mean of the random numbers.
scale : float
Standard deviation of the random numbers.
random_state : np.random.RandomState object, optional [default
Random number generator that shall generate the random numbers.
Examples
>>> import numpy as np >>> from climin.initialize import randomize_normal >>> arr = np.empty((3, 3)) >>> randomize_normal(arr) >>> arr array([[ 0.18076413, 0.60880657, 1.20855691], [ 1.7799948 , -0.82565481, 0.53875307], [-0.67056028, -1.46257419, 1.17033425]]) >>> randomize_normal(arr, 10, 0.1) >>> arr array([[ 10.02221481, 10.0982449 , 10.02495358], [ 9.99867829, 9.99410111, 9.8242318 ], [ 9.9383779 , 9.94880091, 10.03179085]])
Line searches¶
Module containing various line searches.
Line searches are at the heart of many optimizers. After finding a suitable search direction (e.g. the steepest descent direction) we are left with a one-dimensional optimization problem, which can then be solved by a line search.
-
class
climin.linesearch.
BackTrack
(wrt, f, decay=0.9, max_iter=inf, tolerance=1e-20)¶ Class implementing a back tracking line search.
The idea is to jump to a starting step length \(t\) and then shrink that step length by multiplying it with \(\gamma\) until we improve upon the loss.
At most
max_iter
attempts will be done. If the largest absolut value of a component of the step falls belowtolerance
, we stop as well. In both cases, a step length of 0 is returned.To not possibly iterate forever, the field tolerance holds a small value (1E-20 per default). As soon as the absolute value of every component of the step (direction multiplied with the scalar from schedule) is less than tolerance, we stop.
Attributes
wrt (array_like) Parameters over which the optimization is done. f (Callable) Objective function. decay (float) Factor to multiply trials for the step length with. tolerance (float) Minimum absolute value of a component of the step without stopping the line search. Methods
-
__init__
(wrt, f, decay=0.9, max_iter=inf, tolerance=1e-20)¶ Create BackTrack object.
Parameters: wrt : array_like
Parameters over which the optimization is done.
f : Callable
Objective function.
decay : float
Factor to multiply trials for the step length with.
max_iter : int, optional, default infinity
Number of step lengths to try.
tolerance : float
Minimum absolute value of a component of the step without stopping the line search.
-
search
(direction, initialization=1, args=None, kwargs=None, loss0=None)¶ Return a step length
t
given a search direction.Perform the line search along a direction. Search will start at
initialization
and assume that the loss isloss0
att == 0
.Parameters: direction : array_like
Has to be of the same size as .wrt. Points along that direction will tried out to reduce the loss.
initialization : float
First attempt for a step size. Will be reduced by a factor of
.decay
afterwards.args : list, optional, default: None
list of optional arguments for
.f
.kwargs : dictionary, optional, default: None
list of optional keyword arguments for
.f
.loss0 : float, optional
Loss at the current parameters. Will be calculated of not given.
-
-
class
climin.linesearch.
StrongWolfeBackTrack
(wrt, f, fprime, decay=None, c1=0.0001, c2=0.9, tolerance=1e-20)¶ Class implementing a back tracking line search that finds points satisfying the Strong Wolfe conditions.
The idea is to jump to a starting step length \(t\) and then shrink that step length by multiplying it with \(\gamma\) until the strong Wolfe conditions are satisfied. That is the Armijo rule
\[\begin{split}f(\theta_t+ \alpha_t d_t) & \leq f(\theta)+ c_1 \alpha_t d_t^T f'(\theta),\end{split}\]and the curvature condition
\[\begin{split}\big|d_k^TTf('\theta_t+\alpha_t d_t)\big| & \leq c_2 \big|d_t^T f'(\theta_t)\big|.\end{split}\]At most
max_iter
attempts will be done. If the largest absolut value of a component of the step falls belowtolerance
, we stop as well. In both cases, a step length of 0 is returned.To not possibly iterate forever, the field tolerance holds a small value (1E-20 per default). As soon as the absolute value of every component of the step (direction multiplied with the scalar from schedule) is less than tolerance, we stop.
Attributes
wrt (array_like) Parameters over which the optimization is done. f (Callable) Objective function. decay (float) Factor to multiply trials for the step length with. tolerance (float) Minimum absolute value of a component of the step without stopping the line search. c1 (float) Constant in the strong Wolfe conditions. c2 (float) Constant in the strong Wolfe conditions. Methods
-
__init__
(wrt, f, fprime, decay=None, c1=0.0001, c2=0.9, tolerance=1e-20)¶ Create StrongWolfeBackTrack object.
Parameters: wrt : array_like
Parameters over which the optimization is done.
f : Callable
Objective function.
decay : float
Factor to multiply trials for the step length with.
tolerance : float
Minimum absolute value of a component of the step without stopping the line search.
-
-
class
climin.linesearch.
ScipyLineSearch
(wrt, f, fprime)¶ Wrapper around the scipy line search.
Methods
-
class
climin.linesearch.
WolfeLineSearch
(wrt, f, fprime, c1=0.0001, c2=0.9, maxiter=25, min_step_length=1e-09, typ=4)¶ Port of Mark Schmidt’s line search.
Methods
Utility functions¶
-
climin.util.
empty_with_views
(shapes, empty_func=<built-in function empty>)¶ Create an array and views shaped according to
shapes
.The
shapes
parameter is a list of tuples of ints. Each tuple represents a desired shape for an array which will be allocated in a bigger memory region. This memory region will be represented by an array as well.For example, the shape speciciation
[2, (3, 2)]
will create an arrayflat
of size 8. The first view will have a size of(2,)
and point to the first two entries, i.e.flat`[:2]`, while the second array will have a shape of ``(3, 2)
and point to the elementsflat[2:8]
.Parameters: spec : list of tuples of ints
Specification of the desired shapes.
empty_func : callable
function that returns a memory region given an integer of the desired size. (Examples include
numpy.empty
, which is the default,gnumpy.empty
andtheano.tensor.empty
.Returns: flat : array_like (depending on
empty_func
)Memory region containing all the views.
views : list of array_like
Variable number of results. Each contains a view into the array
flat
.Examples
>>> from climin.util import empty_with_views >>> flat, (w, b) = empty_with_views([(3, 2), 2]) >>> w[...] = 1 >>> b[...] = 2 >>> flat array([ 1., 1., 1., 1., 1., 1., 2., 2.]) >>> flat[0] = 3 >>> w array([[ 3., 1.], [ 1., 1.], [ 1., 1.]])
-
climin.util.
shaped_from_flat
(flat, shapes)¶ Given a one dimensional array
flat
, return a list of views of shapesshapes
on that array.Each view will point to a distinct memory region, consecutively allocated in flat.
Parameters: flat : array_like
Array of one dimension.
shapes : list of tuples of ints
Each entry of this list specifies the shape of the corresponding view into
flat
.Returns: views : list of arrays
Each entry has the shape given in
shapes
and points as a view intoflat
.
-
climin.util.
minibatches
(arr, batch_size, d=0)¶ Return a list of views of the given arr.
Each view represents a mini bach of the data.
Parameters: arr : array_like
Array to obtain batches from. Needs to be slicable. If
d > 0
, needs to have a.shape
attribute from which the number of samples can be obtained.batch_size : int
Size of a batch. Last batch might be smaller if
batch_size
is not a divisor ofarr
.d : int, optional, default: 0
Dimension along which the data samples are separated and thus slicing should be done.
Returns: mini_batches : list
Each item of the list is a view of
arr
. Views are ordered.
-
climin.util.
iter_minibatches
(lst, batch_size, dims, n_cycles=None, random_state=None, discard_illsized_batch=False)¶ Return an iterator that successively yields tuples containing aligned minibatches of size batch_size from slicable objects given in lst, in random order without replacement. Because different containers might require slicing over different dimensions, the dimension of each container has to be givens as a list dims.
Parameters: lst : list of array_like
Each item of the list will be sliced into mini batches in alignment with the others.
batch_size : int
Size of each batch. Last batch might be smaller.
dims : list
Aligned with
lst
, gives the dimension along which the data samples are separated.n_cycles : int, optional [default: None]
Number of cycles after which to stop the iterator. If
None
, will yield forever.random_state : a numpy.random.RandomState object, optional [default
Random number generator that will act as a seed for the minibatch order.
discard_illsized_batch : bool, optional [default
If
True
and the length of the sliced dimension is not divisible bybatch_size
, the leftover samples are discarded.Returns: batches : iterator
Infinite iterator of mini batches in random order (without replacement).