Adam

This module provides an implementation of Adam.

class climin.adam.Adam(wrt, fprime, step_rate=0.0002, decay=None, decay_mom1=0.1, decay_mom2=0.001, momentum=0, offset=1e-08, args=None)

Adaptive moment estimation optimizer. (Adam).

Adam is a method for the optimization of stochastic objective functions.

The idea is to estimate the first two moments with exponentially decaying running averages. Additionally, these estimates are bias corrected which improves over the initial learning steps since both estimates are initialized with zeros.

The rest of the documentation follows the original paper [adam2014] and is only meant as a quick primer. We refer to the original source for more details, such as results on convergence and discussion of the various hyper parameters.

Let \(f_t'(\theta_t)\) be the derivative of the loss with respect to the parameters at time step \(t\). In its basic form, given a step rate \(\alpha\), decay terms \(\beta_1\) and \(\beta_2\) for the first and second moment estimates respectively and an offset \(\epsilon\) we initialise the following quantities

\[\begin{split}m_0 & \leftarrow 0 \\ v_0 & \leftarrow 0 \\ t & \leftarrow 0 \\\end{split}\]

and perform the following updates:

\[\begin{split}t & \leftarrow t + 1 \\ g_t & \leftarrow f_t'(\theta_{t-1}) \\ m_t & \leftarrow \beta_1 \cdot g_t + (1 - \beta_1) \cdot m_{t-1} \\ v_t &\leftarrow \beta_2 \cdot g_t^2 + (1 - \beta_2) \cdot v_{t-1}\end{split}\]\[\begin{split}\hat{m}_t &\leftarrow {m_t \over (1 - (1 - \beta_1)^t)} \\ \hat{v}_t &\leftarrow {v_t \over (1 - (1 - \beta_2)^t)} \\ \theta_t &\leftarrow \theta_{t-1} - \alpha {\hat{m}_t \over (\sqrt{\hat{v}_t} + \epsilon)}\end{split}\]

As suggested in the original paper, the last three steps are optimized for efficieny by using:

\[\begin{split}\alpha_t &\leftarrow \alpha {\sqrt{(1 - (1 - \beta_2)^t)} \over (1 - (1 - \beta_1)^t)} \\ \theta_t &\leftarrow \theta_{t-1} - \alpha_t {m_t \over (\sqrt{v_t} + \epsilon)}\end{split}\]

The quantities in the algorithm and their corresponding attributes in the optimizer object are as follows.

Symbol Attribute Meaning
\(t\) n_iter Number of iterations, starting at 0.
\(m_t\) est_mom_1_b Biased estimate of first moment.
\(v_t\) est_mom_2_b Biased estimate of second moment.
\(\hat{m}_t\) est_mom_1 Unbiased estimate of first moment.
\(\hat{v}_t\) est_mom_2 Unbiased estimate of second moment.
\(\alpha\) step_rate Step rate parameter.
\(\beta_1\) decay_mom1 Exponential decay parameter for first moment estimate.
\(\beta_2\) decay_mom2 Exponential decay parameter for second moment estimate.
\(\epsilon\) offset Safety offset for division by estimate of second moment.

Additionally, using Nesterov momentum is possible by setting the momentum attribute of the optimizer to a value other than 0. We apply the momentum step before computing the gradient, resulting in a similar incorporation of Nesterov momentum in Adam as presented in [nadam2015].

Note

The use of decay parameters \(\beta_1\) and \(\beta_2\) differs from the definition in the original paper [adam2014]: With \(\beta^{\ast}_i\) referring to the parameters as defined in the paper, we use \(\beta_i\) with \(\beta_i = 1 - \beta^{\ast}_i\)

[adam2014](1, 2) Kingma, Diederik, and Jimmy Ba. “Adam: A Method for Stochastic Optimization.” arXiv preprint arXiv:1412.6980 (2014).
[nadam2015]Dozat, Timothy “Incorporating Nesterov Momentum into Adam.” Stanford University, Tech. Rep (2015).

Methods

__init__(wrt, fprime, step_rate=0.0002, decay=None, decay_mom1=0.1, decay_mom2=0.001, momentum=0, offset=1e-08, args=None)

Create an Adam object.

Parameters:

wrt : array_like

Array that represents the solution. Will be operated upon in place. fprime should accept this array as a first argument.

fprime : callable

Callable that given a solution vector as first parameter and *args and **kwargs drawn from the iterations args returns a search direction, such as a gradient.

step_rate : scalar or array_like, optional [default: 1]

Value to multiply steps with before they are applied to the parameter vector.

decay_mom1 : float, optional, [default: 0.1]

Decay parameter for the exponential moving average estimate of the first moment.

decay_mom2 : float, optional, [default: 0.001]

Decay parameter for the exponential moving average estimate of the second moment.

momentum : float or array_like, optional [default: 0]

Momentum to use during optimization. Can be specified analogously (but independent of) step rate.

offset : float, optional, [default: 1e-8]

Before taking the square root of the running averages, this offset is added.

args : iterable

Iterator over arguments which fprime will be called with.