D1: Gradient Descent Variants

The Journey Down the Mountain

Imagine you're blindfolded on a mountainside, trying to reach the lowest valley. You can only feel the slope directly beneath your feet. Gradient descent is exactly this — taking steps downhill based on local information, hoping to find the bottom.

But how you take those steps matters enormously. Step too cautiously and you'll take forever. Step too aggressively and you'll overshoot and oscillate. Step without memory and you'll get stuck in every small dip along the way.

This is why we have different optimizers. Each represents a different strategy for navigating that mountain — and watching them race across the same terrain reveals their personalities beautifully.

1. Vanilla Gradient Descent

The most straightforward approach: measure the slope, step downhill. Repeat.

θ = θ - α∇L

In plain English: new position = old position - (learning rate × gradient)

Where It Struggles

Watch vanilla SGD on the "Ravine" surface. You'll see it oscillating back and forth across the narrow valley while making slow progress along the valley floor.

Why? The gradient points toward the nearest downhill direction, which is often across the ravine rather than along it.

Three core problems:

Ravines cause oscillation: Steep in one direction, shallow in another → bouncing between walls
Local minima are traps: No memory means stopping at any flat spot
Learning rate is critical: Too high = diverge, too low = forever

2. Momentum: The Rolling Ball

Instead of a cautious walker, imagine a ball rolling downhill. It has velocity — it accumulates speed when rolling consistently and resists sudden changes.

v = βv + ∇L
θ = θ - αv

The β (typically 0.9) means we keep 90% of previous velocity.

Why This Helps

Dampens oscillations: Side-to-side movements cancel out; consistent downhill direction accumulates
Escapes shallow minima: Momentum carries through small bumps
Accelerates in consistent directions: Velocity builds up, effectively increasing learning rate

Think of β as controlling ball "weight": 0.9 = bowling ball (smooth), 0.99 = boulder (massive momentum, can overshoot).

3. RMSprop: The Adaptive Walker

Different parameters need different learning rates. Frequent-feature weights get large gradients; rare-feature weights get small ones. Same learning rate for both = problems.

S = ρS + (1-ρ)g²
θ = θ - α × g / √(S + ε)

S tracks running average of squared gradients. We divide the gradient by √S to normalize.

The Effect

Large gradients → large S → smaller effective learning rate
Small gradients → small S → larger effective learning rate

All parameters make roughly similar proportional progress, regardless of gradient magnitudes.

4. Adam: Best of Both Worlds

Why choose between momentum and adaptive learning rates? Adam maintains two running averages:

First moment (m): Mean of gradients — this is momentum
Second moment (v): Mean of squared gradients — this is RMSprop

m = β₁m + (1-β₁)g
v = β₂v + (1-β₂)g²
θ = θ - α × m̂ / (√v̂ + ε)

Adam also includes bias correction (m̂, v̂) to handle initialization — early steps aren't too small.

Default Hyperparameters

α = 0.001 (learning rate)
β₁ = 0.9 (momentum decay)
β₂ = 0.999 (RMSprop decay)

These defaults work surprisingly well across many problems. Adam is the "just works" optimizer.

Comparison at a Glance

Aspect	Vanilla	Momentum	RMSprop	Adam
Memory	None	1 buffer	1 buffer	2 buffers
Handles ravines	Poorly	Well	Moderately	Well
Escapes minima	Poorly	Well	Moderately	Well
Adaptive per-param	No	No	Yes	Yes

Why This Matters for ML/AI

Start with Adam. It handles most situations competently without much tuning.

Try SGD with momentum if you need absolute best final performance or are fine-tuning a pretrained model.

Use RMSprop for RNNs or problems where the loss landscape shifts during training.

Diagnosing Training Problems

Loss oscillating wildly? Learning rate too high, or try momentum
Loss decreasing painfully slowly? Rate too low, or stuck in plateau
Some weights exploding, others dead? Switch to adaptive optimizer

Key Takeaways

Momentum = "remember which way I've been going"
RMSprop = "scale steps by how much each parameter usually changes"
Adam = "do both"
Watch the visualization — see how they behave differently on different terrains
The choice of optimizer affects what your model learns, not just speed