Adadelta¶
This module provides an implementation of adadelta.
-
class
climin.adadelta.
Adadelta
(wrt, fprime, step_rate=1, decay=0.9, momentum=0, offset=0.0001, args=None)¶ Adadelta optimizer.
Adadelta [zeiler2013adadelta] is a method that uses the magnitude of recent gradients and steps to obtain an adaptive step rate. An exponential moving average over the gradients and steps is kept; a scale of the learning rate is then obtained by their ration.
Let \(f'(\theta_t)\) be the derivative of the loss with respect to the parameters at time step \(t\). In its basic form, given a step rate \(\alpha\), a decay term \(\gamma\) and an offset \(\epsilon\) we perform the following updates:
\[\begin{split}g_t &=& (1 - \gamma)~f'(\theta_t)^2 + \gamma g_{t-1}\end{split}\]where \(g_0 = 0\). Let \(s_0 = 0\) for updating the parameters:
\[\begin{split}\Delta \theta_t &=& \alpha {\sqrt{s_{t-1} + \epsilon} \over \sqrt{g_t + \epsilon}}~f'(\theta_t), \\ \theta_{t+1} &=& \theta_t - \Delta \theta_t.\end{split}\]Subsequently we adapt the moving average of the steps:
\[\begin{split}s_t &=& (1 - \gamma)~\Delta\theta_t^2 + \gamma s_{t-1}.\end{split}\]To extend this with Nesterov’s accelerated gradient, we need a momentum coefficient \(\beta\) and incorporate it by using slightly different formulas:
\[\begin{split}\theta_{t + {1 \over 2}} &=& \theta_t - \beta \Delta \theta_{t-1}, \\ g_t &=& (1 - \gamma)~f'(\theta_{t + {1 \over 2}})^2 + \gamma g_{t-1}, \\ \Delta \theta_t &=& \alpha {\sqrt{s_{t-1} + \epsilon} \over \sqrt{g_t + \epsilon}}~f'(\theta_{t + {1 \over 2}}).\end{split}\]In its original formulation, the case \(\alpha = 1, \beta = 0\) was considered only.
[zeiler2013adadelta] Zeiler, Matthew D. “ADADELTA: An adaptive learning rate method.” arXiv preprint arXiv:1212.5701 (2012). Methods
-
__init__
(wrt, fprime, step_rate=1, decay=0.9, momentum=0, offset=0.0001, args=None)¶ Create an Adadelta object.
Parameters: wrt : array_like
Array that represents the solution. Will be operated upon in place.
fprime
should accept this array as a first argument.fprime : callable
step_rate : scalar or array_like, optional [default: 1]
Value to multiply steps with before they are applied to the parameter vector.
decay : float, optional [default: 0.9]
Decay parameter for the moving average. Must lie in [0, 1) where lower numbers means a shorter “memory”.
momentum : float or array_like, optional [default: 0]
Momentum to use during optimization. Can be specified analoguously (but independent of) step rate.
offset : float, optional, [default: 1e-4]
Before taking the square root of the running averages, this offset is added.
args : iterable
Iterator over arguments which
fprime
will be called with.
-