The Journey Down the Mountain
Imagine you're blindfolded on a mountainside, trying to reach the lowest valley. You can only feel the slope directly beneath your feet. Gradient descent is exactly this — taking steps downhill based on local information, hoping to find the bottom.
But how you take those steps matters enormously. Step too cautiously and you'll take forever. Step too aggressively and you'll overshoot and oscillate. Step without memory and you'll get stuck in every small dip along the way.
This is why we have different optimizers. Each represents a different strategy for navigating that mountain — and watching them race across the same terrain reveals their personalities beautifully.
1. Vanilla Gradient Descent
The most straightforward approach: measure the slope, step downhill. Repeat.
In plain English: new position = old position - (learning rate × gradient)
Where It Struggles
Watch vanilla SGD on the "Ravine" surface. You'll see it oscillating back and forth across the narrow valley while making slow progress along the valley floor.
Why? The gradient points toward the nearest downhill direction, which is often across the ravine rather than along it.
Three core problems:
- Ravines cause oscillation: Steep in one direction, shallow in another → bouncing between walls
- Local minima are traps: No memory means stopping at any flat spot
- Learning rate is critical: Too high = diverge, too low = forever
2. Momentum: The Rolling Ball
Instead of a cautious walker, imagine a ball rolling downhill. It has velocity — it accumulates speed when rolling consistently and resists sudden changes.
θ = θ - αv
The β (typically 0.9) means we keep 90% of previous velocity.
Why This Helps
- Dampens oscillations: Side-to-side movements cancel out; consistent downhill direction accumulates
- Escapes shallow minima: Momentum carries through small bumps
- Accelerates in consistent directions: Velocity builds up, effectively increasing learning rate
Think of β as controlling ball "weight": 0.9 = bowling ball (smooth), 0.99 = boulder (massive momentum, can overshoot).
3. RMSprop: The Adaptive Walker
Different parameters need different learning rates. Frequent-feature weights get large gradients; rare-feature weights get small ones. Same learning rate for both = problems.
θ = θ - α × g / √(S + ε)
S tracks running average of squared gradients. We divide the gradient by √S to normalize.
The Effect
- Large gradients → large S → smaller effective learning rate
- Small gradients → small S → larger effective learning rate
All parameters make roughly similar proportional progress, regardless of gradient magnitudes.
4. Adam: Best of Both Worlds
Why choose between momentum and adaptive learning rates? Adam maintains two running averages:
- First moment (m): Mean of gradients — this is momentum
- Second moment (v): Mean of squared gradients — this is RMSprop
v = β₂v + (1-β₂)g²
θ = θ - α × m̂ / (√v̂ + ε)
Adam also includes bias correction (m̂, v̂) to handle initialization — early steps aren't too small.
Default Hyperparameters
- α = 0.001 (learning rate)
- β₁ = 0.9 (momentum decay)
- β₂ = 0.999 (RMSprop decay)
These defaults work surprisingly well across many problems. Adam is the "just works" optimizer.
Comparison at a Glance
| Aspect | Vanilla | Momentum | RMSprop | Adam |
|---|---|---|---|---|
| Memory | None | 1 buffer | 1 buffer | 2 buffers |
| Handles ravines | Poorly | Well | Moderately | Well |
| Escapes minima | Poorly | Well | Moderately | Well |
| Adaptive per-param | No | No | Yes | Yes |
Why This Matters for ML/AI
Start with Adam. It handles most situations competently without much tuning.
Try SGD with momentum if you need absolute best final performance or are fine-tuning a pretrained model.
Use RMSprop for RNNs or problems where the loss landscape shifts during training.
Diagnosing Training Problems
- Loss oscillating wildly? Learning rate too high, or try momentum
- Loss decreasing painfully slowly? Rate too low, or stuck in plateau
- Some weights exploding, others dead? Switch to adaptive optimizer
Key Takeaways
- Momentum = "remember which way I've been going"
- RMSprop = "scale steps by how much each parameter usually changes"
- Adam = "do both"
- Watch the visualization — see how they behave differently on different terrains
- The choice of optimizer affects what your model learns, not just speed