class climin.adadelta.Adadelta(wrt, fprime, step_rate=1, decay=0.9, momentum=0, offset=0.0001, args=None)

Adadelta [zeiler2013adadelta] is a method that uses the magnitude of recent gradients and steps to obtain an adaptive step rate. An exponential moving average over the gradients and steps is kept; a scale of the learning rate is then obtained by their ration.

Let $$f'(\theta_t)$$ be the derivative of the loss with respect to the parameters at time step $$t$$. In its basic form, given a step rate $$\alpha$$, a decay term $$\gamma$$ and an offset $$\epsilon$$ we perform the following updates:

$\begin{split}g_t &=& (1 - \gamma)~f'(\theta_t)^2 + \gamma g_{t-1}\end{split}$

where $$g_0 = 0$$. Let $$s_0 = 0$$ for updating the parameters:

$\begin{split}\Delta \theta_t &=& \alpha {\sqrt{s_{t-1} + \epsilon} \over \sqrt{g_t + \epsilon}}~f'(\theta_t), \\ \theta_{t+1} &=& \theta_t - \Delta \theta_t.\end{split}$

Subsequently we adapt the moving average of the steps:

$\begin{split}s_t &=& (1 - \gamma)~\Delta\theta_t^2 + \gamma s_{t-1}.\end{split}$

To extend this with Nesterov’s accelerated gradient, we need a momentum coefficient $$\beta$$ and incorporate it by using slightly different formulas:

$\begin{split}\theta_{t + {1 \over 2}} &=& \theta_t - \beta \Delta \theta_{t-1}, \\ g_t &=& (1 - \gamma)~f'(\theta_{t + {1 \over 2}})^2 + \gamma g_{t-1}, \\ \Delta \theta_t &=& \alpha {\sqrt{s_{t-1} + \epsilon} \over \sqrt{g_t + \epsilon}}~f'(\theta_{t + {1 \over 2}}).\end{split}$

In its original formulation, the case $$\alpha = 1, \beta = 0$$ was considered only.

__init__(wrt, fprime, step_rate=1, decay=0.9, momentum=0, offset=0.0001, args=None)
Parameters: wrt : array_like Array that represents the solution. Will be operated upon in place. fprime should accept this array as a first argument. fprime : callable Callable that given a solution vector as first parameter and *args and **kwargs drawn from the iterations args returns a search direction, such as a gradient. step_rate : scalar or array_like, optional [default: 1] Value to multiply steps with before they are applied to the parameter vector. decay : float, optional [default: 0.9] Decay parameter for the moving average. Must lie in [0, 1) where lower numbers means a shorter “memory”. momentum : float or array_like, optional [default: 0] Momentum to use during optimization. Can be specified analoguously (but independent of) step rate. offset : float, optional, [default: 1e-4] Before taking the square root of the running averages, this offset is added. args : iterable Iterator over arguments which fprime will be called with.