This module provides an implementation of Adam.

Adam is a method for the optimization of stochastic objective functions.

The idea is to estimate the first two moments with exponentially decaying running averages. Additionally, these estimates are bias corrected which improves over the initial learning steps since both estimates are initialized with zeros.

The rest of the documentation follows the original paper [adam2014] and is only meant as a quick primer. We refer to the original source for more details, such as results on convergence and discussion of the various hyper parameters.

Let $$f_t'(\theta_t)$$ be the derivative of the loss with respect to the parameters at time step $$t$$. In its basic form, given a step rate $$\alpha$$, decay terms $$\beta_1$$ and $$\beta_2$$ for the first and second moment estimates respectively and an offset $$\epsilon$$ we initialise the following quantities

$\begin{split}m_0 & \leftarrow 0 \\ v_0 & \leftarrow 0 \\ t & \leftarrow 0 \\\end{split}$

$\begin{split}t & \leftarrow t + 1 \\ g_t & \leftarrow f_t'(\theta_{t-1}) \\ m_t & \leftarrow \beta_1 \cdot g_t + (1 - \beta_1) \cdot m_{t-1} \\ v_t &\leftarrow \beta_2 \cdot g_t^2 + (1 - \beta_2) \cdot v_{t-1}\end{split}$$\begin{split}\hat{m}_t &\leftarrow {m_t \over (1 - (1 - \beta_1)^t)} \\ \hat{v}_t &\leftarrow {v_t \over (1 - (1 - \beta_2)^t)} \\ \theta_t &\leftarrow \theta_{t-1} - \alpha {\hat{m}_t \over (\sqrt{\hat{v}_t} + \epsilon)}\end{split}$

As suggested in the original paper, the last three steps are optimized for efficieny by using:

$\begin{split}\alpha_t &\leftarrow \alpha {\sqrt{(1 - (1 - \beta_2)^t)} \over (1 - (1 - \beta_1)^t)} \\ \theta_t &\leftarrow \theta_{t-1} - \alpha_t {m_t \over (\sqrt{v_t} + \epsilon)}\end{split}$

The quantities in the algorithm and their corresponding attributes in the optimizer object are as follows.

Symbol Attribute Meaning
$$t$$ n_iter Number of iterations, starting at 0.
$$m_t$$ est_mom_1_b Biased estimate of first moment.
$$v_t$$ est_mom_2_b Biased estimate of second moment.
$$\hat{m}_t$$ est_mom_1 Unbiased estimate of first moment.
$$\hat{v}_t$$ est_mom_2 Unbiased estimate of second moment.
$$\alpha$$ step_rate Step rate parameter.
$$\beta_1$$ decay_mom1 Exponential decay parameter for first moment estimate.
$$\beta_2$$ decay_mom2 Exponential decay parameter for second moment estimate.
$$\epsilon$$ offset Safety offset for division by estimate of second moment.

Additionally, using Nesterov momentum is possible by setting the momentum attribute of the optimizer to a value other than 0. We apply the momentum step before computing the gradient, resulting in a similar incorporation of Nesterov momentum in Adam as presented in [nadam2015].

Note

The use of decay parameters $$\beta_1$$ and $$\beta_2$$ differs from the definition in the original paper [adam2014]: With $$\beta^{\ast}_i$$ referring to the parameters as defined in the paper, we use $$\beta_i$$ with $$\beta_i = 1 - \beta^{\ast}_i$$

 [adam2014] (1, 2) Kingma, Diederik, and Jimmy Ba. “Adam: A Method for Stochastic Optimization.” arXiv preprint arXiv:1412.6980 (2014).
 [nadam2015] Dozat, Timothy “Incorporating Nesterov Momentum into Adam.” Stanford University, Tech. Rep (2015).

Methods

__init__(wrt, fprime, step_rate=0.0002, decay=None, decay_mom1=0.1, decay_mom2=0.001, momentum=0, offset=1e-08, args=None)