D2 Optimisation Concepts

Learning Rate Effects

The most important hyperparameter: too small, just right, or explosive.

The Goldilocks Problem

If neural network training were a fairy tale, learning rate would be the porridge. Too hot and you burn everything. Too cold and you wait forever. Getting it just right is often the difference between a model that works and one that doesn't.

The learning rate is typically the first thing practitioners tune, and the wrong value can make training fail completely.

θ = θ - α∇L

The learning rate (α) controls how big each step is. The gradient tells you which direction; the learning rate tells you how far.

Too Small: The Eternal Descent

When the learning rate is too small, the optimizer creeps toward the minimum at a glacial pace. Each step is so tiny that progress is barely visible.

The Symptoms

  • Loss decreases in tiny increments (0.5001 → 0.5000 → 0.4999)
  • Training takes far longer than expected
  • The marker barely moves between steps in the visualization

The Danger

You might terminate training early, thinking the model has converged when it's actually just moving too slowly. You end up with a model worse than it could have been.

Just Right: The Smooth Descent

A well-chosen learning rate produces steady, consistent progress. The loss decreases in meaningful increments without oscillating or crawling.

The Symptoms

  • Loss decreases steadily in a smooth curve
  • Training completes in reasonable epochs
  • The marker moves purposefully toward the minimum

What "Just Right" Means

There's no single correct learning rate — it depends on:

  • Your model architecture
  • Your optimizer (Adam needs lower rates than SGD)
  • Your batch size (larger batches tolerate higher rates)
  • The specific loss landscape of your problem

Too Large: The Explosion

When the learning rate is too large, the optimizer overshoots the minimum and lands on the other side. Then it overcorrects, overshooting again. At worst, the loss explodes toward infinity.

The Symptoms

  • Loss bounces around: 0.5 → 0.3 → 0.8 → 0.2 → 1.5
  • Loss suddenly spikes to inf or NaN
  • The marker jumps dramatically, zig-zags wildly, or flies off

Why It Happens

When your step is larger than the distance to the minimum, you overshoot. If the gradient on the other side is steep, you take another big step — overshooting again. This positive feedback loop amplifies until values overflow.

Typical Ranges by Optimizer

OptimizerTypical RangeDefault
Vanilla SGD0.01 - 0.10.01
SGD + Momentum0.001 - 0.10.01
RMSprop0.0001 - 0.010.001
Adam0.00001 - 0.0010.001

Batch size connection: If you double the batch size, try increasing the learning rate by √2 (about 1.4x).

Learning Rate Schedules

A fixed learning rate is often not optimal throughout training. Early on, take big steps. Later, smaller steps to settle precisely.

Step Decay

Reduce by fixed factor at specific epochs:

Epoch 1-30:   α = 0.01
Epoch 31-60:  α = 0.001
Epoch 61+:    α = 0.0001

Cosine Annealing

Learning rate follows a cosine curve — aggressive reduction in middle, slows as you approach minimum. Some variants restart at intervals ("warm restarts").

Warmup

Start very small and gradually increase before main schedule. Important for large batch training and transformers. Early gradients can be unstable with random weights.

Why This Matters for ML/AI

When training goes wrong, learning rate is often the culprit. Before investigating complex issues:

  • Loss not decreasing? Try a higher learning rate
  • Loss unstable or exploding? Try a lower learning rate
  • Training suddenly failed? Check for learning rate issues

Adam and similar adaptive optimizers are more forgiving, but you still need to choose a base rate. Some research suggests well-tuned SGD with a good schedule can outperform Adam.

Building Your Intuition

Play with the learning rate slider in the visualization:

  • At 0.001: Count how many steps to converge
  • At 0.05: See smooth progress
  • At 0.3: Watch the oscillation begin
  • At 0.5+: See divergence in action

Try on different surfaces — the "just right" rate depends on terrain. The ravine punishes high learning rates more than the simple bowl.

The goal isn't to memorise numbers. It's to recognise patterns: the sluggish creep of too-small, the erratic bouncing of too-large, and the purposeful descent of just-right.