Adadelta [zeiler2013adadelta] is a method that uses the magnitude of recent gradients and steps to obtain an adaptive step rate. An exponential moving average over the gradients and steps is kept; a scale of the learning rate is then obtained by their ration.

Let $$f'(\theta_t)$$ be the derivative of the loss with respect to the parameters at time step $$t$$. In its basic form, given a step rate $$\alpha$$, a decay term $$\gamma$$ and an offset $$\epsilon$$ we perform the following updates:

$\begin{split}g_t &=& (1 - \gamma)~f'(\theta_t)^2 + \gamma g_{t-1}\end{split}$

where $$g_0 = 0$$. Let $$s_0 = 0$$ for updating the parameters:

$\begin{split}\Delta \theta_t &=& \alpha {\sqrt{s_{t-1} + \epsilon} \over \sqrt{g_t + \epsilon}}~f'(\theta_t), \\ \theta_{t+1} &=& \theta_t - \Delta \theta_t.\end{split}$

Subsequently we adapt the moving average of the steps:

$\begin{split}s_t &=& (1 - \gamma)~\Delta\theta_t^2 + \gamma s_{t-1}.\end{split}$

To extend this with Nesterov’s accelerated gradient, we need a momentum coefficient $$\beta$$ and incorporate it by using slightly different formulas:

$\begin{split}\theta_{t + {1 \over 2}} &=& \theta_t - \beta \Delta \theta_{t-1}, \\ g_t &=& (1 - \gamma)~f'(\theta_{t + {1 \over 2}})^2 + \gamma g_{t-1}, \\ \Delta \theta_t &=& \alpha {\sqrt{s_{t-1} + \epsilon} \over \sqrt{g_t + \epsilon}}~f'(\theta_{t + {1 \over 2}}).\end{split}$

In its original formulation, the case $$\alpha = 1, \beta = 0$$ was considered only.