rmsprop

This module provides an implementation of rmsprop.

class climin.rmsprop.RmsProp(wrt, fprime, step_rate, decay=0.9, momentum=0, step_adapt=False, step_rate_min=0, step_rate_max=inf, args=None)

RmsProp optimizer.

RmsProp [tieleman2012rmsprop] is an optimizer that utilizes the magnitude of recent gradients to normalize the gradients. We always keep a moving average over the root mean squared (hence Rms) gradients, by which we divide the current gradient. Let \(f'(\theta_t)\) be the derivative of the loss with respect to the parameters at time step \(t\). In its basic form, given a step rate \(\alpha\) and a decay term \(\gamma\) we perform the following updates:

\[\begin{split}r_t &=& (1 - \gamma)~f'(\theta_t)^2 + \gamma r_{t-1} , \\ v_{t+1} &=& {\alpha \over \sqrt{r_t}} f'(\theta_t), \\ \theta_{t+1} &=& \theta_t - v_{t+1}.\end{split}\]

In some cases, adding a momentum term \(\beta\) is beneficial. Here, Nesterov momentum is used:

\[\begin{split}\theta_{t+{1 \over 2}} &=& \theta_t - \beta v_t, \\ r_t &=& (1 - \gamma)~f'(\theta_{t + {1 \over 2}})^2 + \gamma r_{t-1}, \\ v_{t+1} &=& \beta v_t + {\alpha \over \sqrt{r_t}} f'(\theta_{t + {1 \over 2}}), \\ \theta_{t+1} &=& \theta_t - v_{t+1}\end{split}\]

Additionally, this implementation has adaptable step rates. As soon as the components of the step and the momentum point into the same direction (thus have the same sign) the step rate for that parameter is multiplied with 1 + step_adapt. Otherwise, it is multiplied with 1 - step_adapt. In any way, the minimum and maximum step rates step_rate_min and step_rate_max are respected and exceeding values truncated to it.

RmsProp has several advantages; for one, it is a very robust optimizer which has pseudo curvature information. Additionally, it can deal with stochastic objectives very nicely, making it applicable to mini batch learning.

Note

Works with gnumpy.

[tieleman2012rmsprop]Tieleman, T. and Hinton, G. (2012), Lecture 6.5 - rmsprop, COURSERA: Neural Networks for Machine Learning

Attributes

wrt (array_like) Current solution to the problem. Can be given as a first argument to .fprime.
fprime (Callable) First derivative of the objective function. Returns an array of the same shape as .wrt.
step_rate (float or array_like) Step rate of the optimizer. If an array, means that per parameter step rates are used.
momentum (float or array_like) Momentum of the optimizer. If an array, means that per parameter momentums are used.
step_adapt (float or bool) Constant to adapt step rates. If False, step rate adaption is not done.
step_rate_min (float, optional, default 0) When adapting step rates, do not move below this value.
step_rate_max (float, optional, default inf) When adapting step rates, do not move above this value.

Methods

__init__(wrt, fprime, step_rate, decay=0.9, momentum=0, step_adapt=False, step_rate_min=0, step_rate_max=inf, args=None)

Create an RmsProp object.

Parameters:

wrt : array_like

Array that represents the solution. Will be operated upon in place. fprime should accept this array as a first argument.

fprime : callable

Callable that given a solution vector as first parameter and *args and **kwargs drawn from the iterations args returns a search direction, such as a gradient.

step_rate : float or array_like

Step rate to use during optimization. Can be given as a single scalar value or as an array for a different step rate of each parameter of the problem.

decay : float

Decay parameter for the moving average. Must lie in [0, 1) where lower numbers means a shorter “memory”.

momentum : float or array_like

Momentum to use during optimization. Can be specified analoguously (but independent of) step rate.

step_adapt : float or bool

Constant to adapt step rates. If False, step rate adaption is not done.

step_rate_min : float, optional, default 0

When adapting step rates, do not move below this value.

step_rate_max : float, optional, default inf

When adapting step rates, do not move above this value.

args : iterable

Iterator over arguments which fprime will be called with.