Adadelta

This module provides an implementation of adadelta.

class climin.adadelta.Adadelta(wrt, fprime, step_rate=1, decay=0.9, momentum=0, offset=0.0001, args=None)

Adadelta optimizer.

Adadelta [zeiler2013adadelta] is a method that uses the magnitude of recent gradients and steps to obtain an adaptive step rate. An exponential moving average over the gradients and steps is kept; a scale of the learning rate is then obtained by their ration.

Let \(f'(\theta_t)\) be the derivative of the loss with respect to the parameters at time step \(t\). In its basic form, given a step rate \(\alpha\), a decay term \(\gamma\) and an offset \(\epsilon\) we perform the following updates:

\[\begin{split}g_t &=& (1 - \gamma)~f'(\theta_t)^2 + \gamma g_{t-1}\end{split}\]

where \(g_0 = 0\). Let \(s_0 = 0\) for updating the parameters:

\[\begin{split}\Delta \theta_t &=& \alpha {\sqrt{s_{t-1} + \epsilon} \over \sqrt{g_t + \epsilon}}~f'(\theta_t), \\ \theta_{t+1} &=& \theta_t - \Delta \theta_t.\end{split}\]

Subsequently we adapt the moving average of the steps:

\[\begin{split}s_t &=& (1 - \gamma)~\Delta\theta_t^2 + \gamma s_{t-1}.\end{split}\]

To extend this with Nesterov’s accelerated gradient, we need a momentum coefficient \(\beta\) and incorporate it by using slightly different formulas:

\[\begin{split}\theta_{t + {1 \over 2}} &=& \theta_t - \beta \Delta \theta_{t-1}, \\ g_t &=& (1 - \gamma)~f'(\theta_{t + {1 \over 2}})^2 + \gamma g_{t-1}, \\ \Delta \theta_t &=& \alpha {\sqrt{s_{t-1} + \epsilon} \over \sqrt{g_t + \epsilon}}~f'(\theta_{t + {1 \over 2}}).\end{split}\]

In its original formulation, the case \(\alpha = 1, \beta = 0\) was considered only.

[zeiler2013adadelta]Zeiler, Matthew D. “ADADELTA: An adaptive learning rate method.” arXiv preprint arXiv:1212.5701 (2012).

Methods

__init__(wrt, fprime, step_rate=1, decay=0.9, momentum=0, offset=0.0001, args=None)

Create an Adadelta object.

Parameters:

wrt : array_like

Array that represents the solution. Will be operated upon in place. fprime should accept this array as a first argument.

fprime : callable

Callable that given a solution vector as first parameter and *args and **kwargs drawn from the iterations args returns a search direction, such as a gradient.

step_rate : scalar or array_like, optional [default: 1]

Value to multiply steps with before they are applied to the parameter vector.

decay : float, optional [default: 0.9]

Decay parameter for the moving average. Must lie in [0, 1) where lower numbers means a shorter “memory”.

momentum : float or array_like, optional [default: 0]

Momentum to use during optimization. Can be specified analoguously (but independent of) step rate.

offset : float, optional, [default: 1e-4]

Before taking the square root of the running averages, this offset is added.

args : iterable

Iterator over arguments which fprime will be called with.