Post

adam-betas

adam-betas

Adam - m and v.

In Adam, the parameters are updated using adaptive estimates of the first and second moments of the gradients. These estimates are computed using the following recursive formulas:

$$m_{t} = \beta_{1}\,m_{t-1} + \left(1-\beta_{1}\right)g_{t}$$
$$v_{t} = \beta_{2}\,v_{t-1} + \left(1-\beta_{2}\right)g_{t}^{2}$$

Here’s what each term means:

  1. \(g_t\):
    This is the gradient of the loss with respect to the parameters at the current time step $t$.

  2. \(m_t\) (First Moment Estimate):
    • Think of $m_t$ as a running (exponential) average of the gradients.
    • It is analogous to the “momentum” term in gradient descent with momentum. In standard momentum, you update a velocity vector that smooths the gradient updates over time.
    • In Adam, $\beta_1$ (the first value in the betas tuple) controls how much weight is given to the past gradients versus the current gradient. A value of $\beta_1=0.5$ means that the current gradient $g_t$ has more influence (the running average is more “responsive”) compared to a higher value like 0.9.
  3. \(v_t\) (Second Moment Estimate):
    • \(v_t\) is a running average of the squared gradients. It essentially tracks the uncentered variance (or the “second moment”) of the gradients.
    • This is similar in spirit to what RMSProp does by keeping track of the magnitude of recent gradients.
    • The hyperparameter $\beta_2$ (here, 0.999) controls the decay rate for this average. A high value means that $v_t$ changes slowly, providing a stable estimate of the gradient’s variance.

How These Connect to Gradient Descent with Momentum and Adam

  • Momentum Analogy:
    In gradient descent with momentum, you would update a velocity vector $v$ like this: \(v_t = \mu\, v_{t-1} + (1-\mu)\, g_t\) where \(\mu\) (often around 0.9) plays a role very similar to $\beta_1$ in Adam. However, in Adam, we call this running average $m_t$ to denote that it is an estimate of the first moment (the mean) of the gradients.

  • Adaptive Learning Rates:
    The second moment estimate $v_t$ is used to adjust the learning rate for each parameter individually. By keeping track of the squared gradients, Adam can divide by the square root of $v_t$ (plus a small $\epsilon$ for numerical stability) so that parameters with high variance in gradients get smaller updates. This adaptive scaling is a key feature of Adam.

  • Bias Correction:
    Since both $m_t$ and $v_t$ are initialized at zero, they are biased towards zero during the initial steps. Adam includes bias-correction terms: \(\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t}\) These corrected values are then used in the parameter update: \(\theta_{t+1} = \theta_{t} - \text{lr} \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}\)


Summary

  • \(m_t\):
    • It’s the exponential moving average of the gradients (the first moment).
    • It functions like momentum, smoothing the gradient updates.
    • The parameter $\beta_1$ determines how much past gradients influence the current average; a lower value (e.g., 0.5) means the update is more sensitive to the current gradient.
  • \(v_t\):
    • It’s the exponential moving average of the squared gradients (the second moment).
    • It estimates the variance (or magnitude) of the gradients.
    • The parameter \(\beta_2\) (typically set close to 1, such as 0.999) means that $v_t$ is a very smooth, long-term estimate.

Together, these moving averages allow Adam to adjust the learning rate for each parameter individually, making it more robust to noisy gradients and ensuring better convergence. This combination of momentum-like smoothing (via \(m_t\)) and adaptive scaling (via \(v_t\)) is what makes Adam particularly effective for training deep networks.


Ituition of the equation

$$m_{t} = \beta_{1}\,m_{t-1} + \left(1-\beta_{1}\right)g_{t}$$
$$v_{t} = \beta_{2}\,v_{t-1} + \left(1-\beta_{2}\right)g_{t}^{2}$$

In Adam:

  • (m_t) is an exponential moving average of the gradients. You can think of it as the “momentum” term. The update

    [ m_t = \beta_1\, m_{t-1} + (1-\beta_1)\, g_t ]

    means that (m_t) represents a running estimate (first moment) of the gradients (g_t). It smooths out the noise in the gradient by combining the current gradient with the past average.

  • (v_t) is an exponential moving average of the squared gradients. Its update

    [ v_t = \beta_2\, v_{t-1} + (1-\beta_2)\, g_t^2 ]

    accumulates the squares of the gradients, which approximates the uncentered variance (the second moment) of the gradients. This term isn’t literally “the gradient of the variance” but rather a way to measure the magnitude or variability of the gradients over time. It serves to adapt the learning rate for each parameter based on how large or noisy the gradients are.

So, in summary:

  • (m_t) is like the momentum of the gradients (a smoothed average of (g_t)),
  • (v_t) is like the momentum of the squared gradients (a smoothed average of (g_t^2)) that approximates the variance.

This dual mechanism lets Adam adjust each parameter’s update dynamically: the first moment (m_t) gives a direction, while the second moment (v_t) scales the step size so that parameters with high variance in their gradients get smaller updates.

This post is licensed under CC BY 4.0 by the author.