This module provides an implementation of gradient descent.

class climin.gd.GradientDescent(wrt, fprime, step_rate=0.1, momentum=0.0, momentum_type='standard', args=None)

Gradient descent works by iteratively performing updates solely based on the first derivative of a problem. The gradient is calculated and multiplied with a scalar (or component wise with a vector) to do a step in the problem space. For speed ups, a technique called “momentum” is often used, which averages search steps over iterations.

Even though gradient descent is pretty simple it can be very effective if well tuned (in terms of its hyper parameters step rate and momentum). Sometimes the use of schedules for both parameters is necessary. See climin.schedule for basic schedules.

Gradient descent is also very robust to stochasticity in the objective function. This might result from noise injected into it (e.g. in the case of denoising auto encoders) or because it is based on data samples (e.g. in the case of stochastic mini batches.)

Given a step rate $$\alpha$$ and a function $$f'$$ to evaluate the search direction the current paramters $$\theta_t$$ the following update is performed:

$\begin{split}v_{t+1} &= \alpha f'(\theta_t) \\ \theta_{t+1} &= \theta_t - v_{t+1}.\end{split}$

If we also have a momentum $$\beta$$ and are using standard momentum, we update the parameters according to:

$\begin{split}v_{t+1} &= \alpha f'(\theta_t) + \beta v_{t} \\ \theta_{t+1} &= \theta_t - v_{t+1}\end{split}$

In some cases (e.g. learning the parameters of deep networks), using Nesterov momentum can be beneficial. In this case, we first make a momentum step and then evaluate the gradient at the location in between. Thus, there is an additional cost of an addition of the parameters.

$\begin{split}\theta_{t+{1 \over 2}} &= \theta_t - \beta v_t \\ v_{t+1} &= \alpha f'(\theta_{t + {1 \over 2}}) \\ \theta_{t+1} &= \theta_t - v_{t+1}\end{split}$

which can be specified additionally by the initialization argument momentum_type.

Note

Works with gnumpy.

Attributes

 wrt (array_like) Current solution to the problem. Can be given as a first argument to .fprime. fprime (Callable) First derivative of the objective function. Returns an array of the same shape as .wrt. step_rate (float or array_like) Step rate to multiply the gradients with. momentum (float or array_like) Momentum to multiply previous steps with. momentum_type (string (either “standard” or “nesterov”)) When to add the momentum term to the parameter vector; in the first case it will be done after the calculation of the gradient, in the latter before.

Methods

__init__(wrt, fprime, step_rate=0.1, momentum=0.0, momentum_type='standard', args=None)