D3: Batch vs Stochastic vs Mini-batch

The Noise-Accuracy Trade-off

So far, we've treated gradient descent as if there's one clean gradient pointing downhill. But in real machine learning, the gradient depends on your data — and how much data you use to estimate that gradient changes everything.

Use all your data? You get a perfect gradient but wait forever. Use one sample? Lightning fast but wildly noisy. The sweet spot in the middle is where all of modern deep learning lives.

∇L(θ) = (1/N) Σᵢ ∇Lᵢ(θ)

The gradient is the average of individual gradients for each data point. Do you need all N to take one step?

Batch (Full) Gradient Descent

Compute the gradient using every single data point before taking one step. With 50,000 training images, process all 50,000 to update weights once.

The Pros

Accurate gradient: True direction of steepest descent
Smooth convergence: Clean, direct paths — no zigzagging
Deterministic: Same starting point, same path, every time

The Cons

Painfully slow: Each update requires the entire dataset
Memory hungry: Often need whole dataset in memory
Gets stuck easily: Follows gradient precisely into local minima

Rarely used in modern deep learning — primarily a teaching tool.

Stochastic Gradient Descent (SGD)

The opposite extreme: compute gradient from one randomly chosen sample, then immediately update.

With 50,000 images, you make 50,000 weight updates per epoch instead of one.

The Pros

Blazingly fast updates: Process just one sample per step
Memory efficient: Only one sample in memory at a time
Escapes local minima: The noise lets the optimizer jump out of shallow traps

The Cons

Extremely noisy: Single sample = terrible gradient estimate
Erratic path: Two steps forward, one sideways, half back
Wastes hardware: GPUs designed for parallel batches

The Noise Advantage

SGD's noise is genuinely helpful. Research shows stochastic methods find better minima — the noise acts like regularisation, preventing overfitting. Like shaking a box of sand to find stable arrangement.

Mini-batch: The Practical Middle Ground

Compute gradient using a subset of data — typically 32 to 512 samples — then update.

With 50,000 images and batch size 64, you make ~780 updates per epoch.

Why Mini-batch Wins

Good gradient estimates: 64 samples much better than 1, still noisy enough to escape minima
Hardware utilisation: GPUs process batches nearly as fast as single images
Memory balance: 64 samples fit comfortably in GPU memory
Regularisation effect: Controlled noise still provides benefits

The Universal Standard

When practitioners say "SGD" today, they almost always mean mini-batch gradient descent. True single-sample SGD is rarely used.

Batch Size Effects

Batch size isn't just about speed — it affects what your model learns.

Larger Batches

More accurate gradients, more stable training
Faster wall-clock time (better GPU utilisation)
Higher memory usage
Worse generalisation: Can find "sharp" minima that don't transfer

Smaller Batches

Noisier gradients = more regularisation
More updates per epoch
Lower memory usage
Better generalisation: Often finds "flat" minima

Common Choices

Batch Size	Use Case
16-32	Limited GPU memory, want more noise
64-128	General-purpose default
256-512	Large-scale training with big GPUs
1000+	Distributed training across many GPUs

Learning rate connection: When you increase batch size, increase learning rate too. Double batch → try doubling LR.

What You'll See in the Visualization

Batch GD: Smooth curve directly toward minimum — elegant but slow
Stochastic GD: Zigzags wildly but explores broadly — noise is visible as jitter
Mini-batch GD: Controlled wobble — the practical compromise

Adjust the batch size slider and watch paths smooth toward batch GD behaviour or add SGD-like noise.

Why This Matters for ML/AI

Mini-batch is universal. Virtually every neural network uses it.

Batch size is a hyperparameter that affects:

Training speed (larger = faster per epoch)
Memory requirements
Generalisation performance
Optimal learning rate

Noise as regularisation: The noise from stochastic methods isn't just tolerated — it's beneficial. Even with infinite compute, you wouldn't necessarily want true batch GD.

The Terminology Trap

When papers say "SGD", they usually mean mini-batch with momentum
When tutorials say "batch", they sometimes mean mini-batch
When frameworks offer "batch_size", they mean mini-batch size
True single-sample updates are called "online learning"

In practice, everything is mini-batch, regardless of what it's called.