Adam¶
This module provides an implementation of Adam.
-
class
climin.adam.Adam(wrt, fprime, step_rate=0.0002, decay=None, decay_mom1=0.1, decay_mom2=0.001, momentum=0, offset=1e-08, args=None)¶ Adaptive moment estimation optimizer. (Adam).
Adam is a method for the optimization of stochastic objective functions.
The idea is to estimate the first two moments with exponentially decaying running averages. Additionally, these estimates are bias corrected which improves over the initial learning steps since both estimates are initialized with zeros.
The rest of the documentation follows the original paper [adam2014] and is only meant as a quick primer. We refer to the original source for more details, such as results on convergence and discussion of the various hyper parameters.
Let \(f_t'(\theta_t)\) be the derivative of the loss with respect to the parameters at time step \(t\). In its basic form, given a step rate \(\alpha\), decay terms \(\beta_1\) and \(\beta_2\) for the first and second moment estimates respectively and an offset \(\epsilon\) we initialise the following quantities
\[\begin{split}m_0 & \leftarrow 0 \\ v_0 & \leftarrow 0 \\ t & \leftarrow 0 \\\end{split}\]and perform the following updates:
\[\begin{split}t & \leftarrow t + 1 \\ g_t & \leftarrow f_t'(\theta_{t-1}) \\ m_t & \leftarrow \beta_1 \cdot g_t + (1 - \beta_1) \cdot m_{t-1} \\ v_t &\leftarrow \beta_2 \cdot g_t^2 + (1 - \beta_2) \cdot v_{t-1}\end{split}\]\[\begin{split}\hat{m}_t &\leftarrow {m_t \over (1 - (1 - \beta_1)^t)} \\ \hat{v}_t &\leftarrow {v_t \over (1 - (1 - \beta_2)^t)} \\ \theta_t &\leftarrow \theta_{t-1} - \alpha {\hat{m}_t \over (\sqrt{\hat{v}_t} + \epsilon)}\end{split}\]As suggested in the original paper, the last three steps are optimized for efficieny by using:
\[\begin{split}\alpha_t &\leftarrow \alpha {\sqrt{(1 - (1 - \beta_2)^t)} \over (1 - (1 - \beta_1)^t)} \\ \theta_t &\leftarrow \theta_{t-1} - \alpha_t {m_t \over (\sqrt{v_t} + \epsilon)}\end{split}\]The quantities in the algorithm and their corresponding attributes in the optimizer object are as follows.
Symbol Attribute Meaning \(t\) n_iterNumber of iterations, starting at 0. \(m_t\) est_mom_1_bBiased estimate of first moment. \(v_t\) est_mom_2_bBiased estimate of second moment. \(\hat{m}_t\) est_mom_1Unbiased estimate of first moment. \(\hat{v}_t\) est_mom_2Unbiased estimate of second moment. \(\alpha\) step_rateStep rate parameter. \(\beta_1\) decay_mom1Exponential decay parameter for first moment estimate. \(\beta_2\) decay_mom2Exponential decay parameter for second moment estimate. \(\epsilon\) offsetSafety offset for division by estimate of second moment. Additionally, using Nesterov momentum is possible by setting the momentum attribute of the optimizer to a value other than 0. We apply the momentum step before computing the gradient, resulting in a similar incorporation of Nesterov momentum in Adam as presented in [nadam2015].
Note
The use of decay parameters \(\beta_1\) and \(\beta_2\) differs from the definition in the original paper [adam2014]: With \(\beta^{\ast}_i\) referring to the parameters as defined in the paper, we use \(\beta_i\) with \(\beta_i = 1 - \beta^{\ast}_i\)
[adam2014] (1, 2) Kingma, Diederik, and Jimmy Ba. “Adam: A Method for Stochastic Optimization.” arXiv preprint arXiv:1412.6980 (2014). [nadam2015] Dozat, Timothy “Incorporating Nesterov Momentum into Adam.” Stanford University, Tech. Rep (2015). Methods
-
__init__(wrt, fprime, step_rate=0.0002, decay=None, decay_mom1=0.1, decay_mom2=0.001, momentum=0, offset=1e-08, args=None)¶ Create an Adam object.
Parameters: wrt : array_like
Array that represents the solution. Will be operated upon in place.
fprimeshould accept this array as a first argument.fprime : callable
step_rate : scalar or array_like, optional [default: 1]
Value to multiply steps with before they are applied to the parameter vector.
decay_mom1 : float, optional, [default: 0.1]
Decay parameter for the exponential moving average estimate of the first moment.
decay_mom2 : float, optional, [default: 0.001]
Decay parameter for the exponential moving average estimate of the second moment.
momentum : float or array_like, optional [default: 0]
Momentum to use during optimization. Can be specified analogously (but independent of) step rate.
offset : float, optional, [default: 1e-8]
Before taking the square root of the running averages, this offset is added.
args : iterable
Iterator over arguments which
fprimewill be called with.
-