The Noise-Accuracy Trade-off
So far, we've treated gradient descent as if there's one clean gradient pointing downhill. But in real machine learning, the gradient depends on your data — and how much data you use to estimate that gradient changes everything.
Use all your data? You get a perfect gradient but wait forever. Use one sample? Lightning fast but wildly noisy. The sweet spot in the middle is where all of modern deep learning lives.
The gradient is the average of individual gradients for each data point. Do you need all N to take one step?
Batch (Full) Gradient Descent
Compute the gradient using every single data point before taking one step. With 50,000 training images, process all 50,000 to update weights once.
The Pros
- Accurate gradient: True direction of steepest descent
- Smooth convergence: Clean, direct paths — no zigzagging
- Deterministic: Same starting point, same path, every time
The Cons
- Painfully slow: Each update requires the entire dataset
- Memory hungry: Often need whole dataset in memory
- Gets stuck easily: Follows gradient precisely into local minima
Rarely used in modern deep learning — primarily a teaching tool.
Stochastic Gradient Descent (SGD)
The opposite extreme: compute gradient from one randomly chosen sample, then immediately update.
With 50,000 images, you make 50,000 weight updates per epoch instead of one.
The Pros
- Blazingly fast updates: Process just one sample per step
- Memory efficient: Only one sample in memory at a time
- Escapes local minima: The noise lets the optimizer jump out of shallow traps
The Cons
- Extremely noisy: Single sample = terrible gradient estimate
- Erratic path: Two steps forward, one sideways, half back
- Wastes hardware: GPUs designed for parallel batches
The Noise Advantage
SGD's noise is genuinely helpful. Research shows stochastic methods find better minima — the noise acts like regularisation, preventing overfitting. Like shaking a box of sand to find stable arrangement.
Mini-batch: The Practical Middle Ground
Compute gradient using a subset of data — typically 32 to 512 samples — then update.
With 50,000 images and batch size 64, you make ~780 updates per epoch.
Why Mini-batch Wins
- Good gradient estimates: 64 samples much better than 1, still noisy enough to escape minima
- Hardware utilisation: GPUs process batches nearly as fast as single images
- Memory balance: 64 samples fit comfortably in GPU memory
- Regularisation effect: Controlled noise still provides benefits
The Universal Standard
When practitioners say "SGD" today, they almost always mean mini-batch gradient descent. True single-sample SGD is rarely used.
Batch Size Effects
Batch size isn't just about speed — it affects what your model learns.
Larger Batches
- More accurate gradients, more stable training
- Faster wall-clock time (better GPU utilisation)
- Higher memory usage
- Worse generalisation: Can find "sharp" minima that don't transfer
Smaller Batches
- Noisier gradients = more regularisation
- More updates per epoch
- Lower memory usage
- Better generalisation: Often finds "flat" minima
Common Choices
| Batch Size | Use Case |
|---|---|
| 16-32 | Limited GPU memory, want more noise |
| 64-128 | General-purpose default |
| 256-512 | Large-scale training with big GPUs |
| 1000+ | Distributed training across many GPUs |
Learning rate connection: When you increase batch size, increase learning rate too. Double batch → try doubling LR.
What You'll See in the Visualization
- Batch GD: Smooth curve directly toward minimum — elegant but slow
- Stochastic GD: Zigzags wildly but explores broadly — noise is visible as jitter
- Mini-batch GD: Controlled wobble — the practical compromise
Adjust the batch size slider and watch paths smooth toward batch GD behaviour or add SGD-like noise.
Why This Matters for ML/AI
Mini-batch is universal. Virtually every neural network uses it.
Batch size is a hyperparameter that affects:
- Training speed (larger = faster per epoch)
- Memory requirements
- Generalisation performance
- Optimal learning rate
Noise as regularisation: The noise from stochastic methods isn't just tolerated — it's beneficial. Even with infinite compute, you wouldn't necessarily want true batch GD.
The Terminology Trap
- When papers say "SGD", they usually mean mini-batch with momentum
- When tutorials say "batch", they sometimes mean mini-batch
- When frameworks offer "batch_size", they mean mini-batch size
- True single-sample updates are called "online learning"
In practice, everything is mini-batch, regardless of what it's called.