Gradient Descent

This module provides an implementation of gradient descent.

class climin.gd.GradientDescent(wrt, fprime, step_rate=0.1, momentum=0.0, momentum_type='standard', args=None)

Classic gradient descent optimizer.

Gradient descent works by iteratively performing updates solely based on the first derivative of a problem. The gradient is calculated and multiplied with a scalar (or component wise with a vector) to do a step in the problem space. For speed ups, a technique called “momentum” is often used, which averages search steps over iterations.

Even though gradient descent is pretty simple it can be very effective if well tuned (in terms of its hyper parameters step rate and momentum). Sometimes the use of schedules for both parameters is necessary. See climin.schedule for basic schedules.

Gradient descent is also very robust to stochasticity in the objective function. This might result from noise injected into it (e.g. in the case of denoising auto encoders) or because it is based on data samples (e.g. in the case of stochastic mini batches.)

Given a step rate \(\alpha\) and a function \(f'\) to evaluate the search direction the current paramters \(\theta_t\) the following update is performed:

\[\begin{split}v_{t+1} &= \alpha f'(\theta_t) \\ \theta_{t+1} &= \theta_t - v_{t+1}.\end{split}\]

If we also have a momentum \(\beta\) and are using standard momentum, we update the parameters according to:

\[\begin{split}v_{t+1} &= \alpha f'(\theta_t) + \beta v_{t} \\ \theta_{t+1} &= \theta_t - v_{t+1}\end{split}\]

In some cases (e.g. learning the parameters of deep networks), using Nesterov momentum can be beneficial. In this case, we first make a momentum step and then evaluate the gradient at the location in between. Thus, there is an additional cost of an addition of the parameters.

\[\begin{split}\theta_{t+{1 \over 2}} &= \theta_t - \beta v_t \\ v_{t+1} &= \alpha f'(\theta_{t + {1 \over 2}}) \\ \theta_{t+1} &= \theta_t - v_{t+1}\end{split}\]

which can be specified additionally by the initialization argument momentum_type.

Note

Works with gnumpy.

Attributes

wrt (array_like) Current solution to the problem. Can be given as a first argument to .fprime.
fprime (Callable) First derivative of the objective function. Returns an array of the same shape as .wrt.
step_rate (float or array_like) Step rate to multiply the gradients with.
momentum (float or array_like) Momentum to multiply previous steps with.
momentum_type (string (either “standard” or “nesterov”)) When to add the momentum term to the parameter vector; in the first case it will be done after the calculation of the gradient, in the latter before.

Methods

__init__(wrt, fprime, step_rate=0.1, momentum=0.0, momentum_type='standard', args=None)

Create a GradientDescent object.

Parameters:

wrt : array_like

Array that represents the solution. Will be operated upon in place. fprime should accept this array as a first argument.

fprime : callable

Callable that given a solution vector as first parameter and *args and **kwargs drawn from the iterations args returns a search direction, such as a gradient.

step_rate : float or array_like, or iterable of that

Step rate to use during optimization. Can be given as a single scalar value or as an array for a different step rate of each parameter of the problem.

Can also be given as an iterator; in that case, every iteration of the optimization takes a new element as a step rate from that iterator.

momentum : float or array_like, or iterable of that

Momentum to use during optimization. Can be specified analoguously (but independent of) step rate.

momentum_type : string (either “standard” or “nesterov”)

When to add the momentum term to the paramter vector; in the first case it will be done after the calculation of the gradient, in the latter before.

args : iterable

Iterator of arguments which fprime will be called with.