D1 Optimisation Concepts

Gradient Descent Variants

From vanilla to Adam: comparing optimizers on the same terrain.

The Journey Down the Mountain

Imagine you're blindfolded on a mountainside, trying to reach the lowest valley. You can only feel the slope directly beneath your feet. Gradient descent is exactly this — taking steps downhill based on local information, hoping to find the bottom.

But how you take those steps matters enormously. Step too cautiously and you'll take forever. Step too aggressively and you'll overshoot and oscillate. Step without memory and you'll get stuck in every small dip along the way.

This is why we have different optimizers. Each represents a different strategy for navigating that mountain — and watching them race across the same terrain reveals their personalities beautifully.

1. Vanilla Gradient Descent

The most straightforward approach: measure the slope, step downhill. Repeat.

θ = θ - α∇L

In plain English: new position = old position - (learning rate × gradient)

Where It Struggles

Watch vanilla SGD on the "Ravine" surface. You'll see it oscillating back and forth across the narrow valley while making slow progress along the valley floor.

Why? The gradient points toward the nearest downhill direction, which is often across the ravine rather than along it.

Three core problems:

  • Ravines cause oscillation: Steep in one direction, shallow in another → bouncing between walls
  • Local minima are traps: No memory means stopping at any flat spot
  • Learning rate is critical: Too high = diverge, too low = forever

2. Momentum: The Rolling Ball

Instead of a cautious walker, imagine a ball rolling downhill. It has velocity — it accumulates speed when rolling consistently and resists sudden changes.

v = βv + ∇L
θ = θ - αv

The β (typically 0.9) means we keep 90% of previous velocity.

Why This Helps

  • Dampens oscillations: Side-to-side movements cancel out; consistent downhill direction accumulates
  • Escapes shallow minima: Momentum carries through small bumps
  • Accelerates in consistent directions: Velocity builds up, effectively increasing learning rate

Think of β as controlling ball "weight": 0.9 = bowling ball (smooth), 0.99 = boulder (massive momentum, can overshoot).

3. RMSprop: The Adaptive Walker

Different parameters need different learning rates. Frequent-feature weights get large gradients; rare-feature weights get small ones. Same learning rate for both = problems.

S = ρS + (1-ρ)g²
θ = θ - α × g / √(S + ε)

S tracks running average of squared gradients. We divide the gradient by √S to normalize.

The Effect

  • Large gradients → large S → smaller effective learning rate
  • Small gradients → small S → larger effective learning rate

All parameters make roughly similar proportional progress, regardless of gradient magnitudes.

4. Adam: Best of Both Worlds

Why choose between momentum and adaptive learning rates? Adam maintains two running averages:

  • First moment (m): Mean of gradients — this is momentum
  • Second moment (v): Mean of squared gradients — this is RMSprop
m = β₁m + (1-β₁)g
v = β₂v + (1-β₂)g²
θ = θ - α × m̂ / (√v̂ + ε)

Adam also includes bias correction (m̂, v̂) to handle initialization — early steps aren't too small.

Default Hyperparameters

  • α = 0.001 (learning rate)
  • β₁ = 0.9 (momentum decay)
  • β₂ = 0.999 (RMSprop decay)

These defaults work surprisingly well across many problems. Adam is the "just works" optimizer.

Comparison at a Glance

AspectVanillaMomentumRMSpropAdam
MemoryNone1 buffer1 buffer2 buffers
Handles ravinesPoorlyWellModeratelyWell
Escapes minimaPoorlyWellModeratelyWell
Adaptive per-paramNoNoYesYes

Why This Matters for ML/AI

Start with Adam. It handles most situations competently without much tuning.

Try SGD with momentum if you need absolute best final performance or are fine-tuning a pretrained model.

Use RMSprop for RNNs or problems where the loss landscape shifts during training.

Diagnosing Training Problems

  • Loss oscillating wildly? Learning rate too high, or try momentum
  • Loss decreasing painfully slowly? Rate too low, or stuck in plateau
  • Some weights exploding, others dead? Switch to adaptive optimizer

Key Takeaways

  1. Momentum = "remember which way I've been going"
  2. RMSprop = "scale steps by how much each parameter usually changes"
  3. Adam = "do both"
  4. Watch the visualization — see how they behave differently on different terrains
  5. The choice of optimizer affects what your model learns, not just speed