C6: Softmax and Cross-Entropy

The Classification Pipeline

When a neural network classifies an image as "cat," "dog," or "bird," what actually happens inside? The network doesn't output the word "cat." It outputs numbers—raw scores for each possible class. These scores need to become probabilities, and those probabilities need to become a single number measuring how wrong the prediction was.

Raw Scores (Logits) → Softmax → Probabilities → Cross-Entropy → Loss
[2.0, 1.0, 0.1] → [0.659, 0.242, 0.099] → 0.42

Each step has a purpose. Softmax turns arbitrary numbers into probabilities. Cross-entropy measures how far those probabilities are from the truth. Together, they form the standard ending for classification networks.

What Are Logits?

The last layer of a classification network produces logits—raw, unnormalized scores. One score per class.

Network output for 3-class problem:
  Cat:  2.0
  Dog:  1.0
  Bird: 0.1

These numbers have no restrictions: they can be positive, negative, or zero, any magnitude, and they don't sum to anything special.

Logits encode the network's relative preferences. A higher score means the network favors that class. But logits aren't probabilities—we need to convert them.

The Softmax Function

Softmax transforms logits into probabilities:

softmax(zᵢ) = e^zᵢ / Σⱼ e^zⱼ

In words: take e raised to each score, then normalize by dividing by the sum of all exponentials.

Step-by-step example:

Starting with logits [2.0, 1.0, 0.1]:

Step 1: Compute exponentials

e^2.0 = 7.389
e^1.0 = 2.718
e^0.1 = 1.105

Step 2: Sum the exponentials

7.389 + 2.718 + 1.105 = 11.212

Step 3: Divide each by the sum

Cat:  7.389 / 11.212 = 0.659  (65.9%)
Dog:  2.718 / 11.212 = 0.242  (24.2%)
Bird: 1.105 / 11.212 = 0.099  (9.9%)

Verify: 0.659 + 0.242 + 0.099 = 1.000 ✓

Why Exponentials?

The exponential function has special properties that make it perfect for this job:

Always positive: e^x > 0 for any x. This guarantees all probabilities are positive.
Preserves ordering: If z₁ > z₂, then e^z₁ > e^z₂. The class with the highest logit gets the highest probability.
Amplifies differences: The exponential grows rapidly. Small differences in logits become larger differences in probabilities.

When the network is confident (large differences in logits), softmax produces sharp probability distributions. When uncertain (similar logits), it produces softer distributions.

Temperature: Controlling Confidence

Softmax has an optional parameter called temperature (T):

softmax(zᵢ; T) = e^(zᵢ/T) / Σⱼ e^(zⱼ/T)

Low temperature (T = 0.5): Distribution becomes sharper—more confident
High temperature (T = 2.0): Distribution becomes softer—less confident
T → 0: Approaches argmax. One class gets probability 1
T → ∞: Approaches uniform distribution

Temperature is used during inference to control randomness in text generation and other sampling tasks.

Cross-Entropy Loss

Now we have probabilities. How do we measure how wrong they are?

Cross-entropy loss compares the predicted probability distribution to the true distribution:

L = -Σᵢ yᵢ · log(pᵢ)

Where yᵢ is the true label (1 for correct class, 0 for others) and pᵢ is the predicted probability.

For classification with one-hot labels, this simplifies: only the correct class matters!

L = -log(p_correct)

Cross-entropy loss is simply the negative log of the probability assigned to the true class.

The Log Penalty

The negative log function creates a harsh penalty structure:

p_correct	Loss
0.99	0.01 (tiny)
0.90	0.11
0.50	0.69
0.10	2.30
0.01	4.61 (huge!)

The pattern is clear:

Confident and correct: Probability near 1 → loss near 0
Uncertain: Probability around 0.5 → moderate loss
Confident and wrong: Probability near 0 → loss explodes

This asymmetry is the genius of cross-entropy. It severely punishes confident wrong predictions.

Numerical Stability

There's a practical problem with naive softmax computation:

Overflow: If logits are large, e^z explodes (e^1000 = Infinity)
Underflow: If logits are very negative, e^z becomes zero

Solution: Subtract the maximum

A beautiful property of softmax: subtracting a constant from all logits doesn't change the result. By choosing c = max(z), we ensure the largest exponent is e^0 = 1, preventing overflow.

This is the log-sum-exp trick, and every deep learning framework uses it automatically.

The Elegant Gradient

When training, we need the gradient of the loss with respect to logits. The combined softmax + cross-entropy has an elegant gradient:

∂L/∂zᵢ = pᵢ - yᵢ

That's it! The gradient is simply the predicted probability minus the true label.

For the correct class: If p_correct = 0.8, gradient is -0.2. Negative → increase this logit.
For incorrect classes: If p_wrong = 0.15, gradient is 0.15. Positive → decrease this logit.

The network learns to increase logits for correct classes and decrease logits for incorrect ones.

Why This Matters for ML/AI

Every classification network uses this: Image, text, token classification—softmax + cross-entropy is the universal ending.
Debugging training issues: NaN loss often means log(0); infinite loss means overflow in exponentials.
Model calibration: Cross-entropy encourages calibrated predictions that reflect true frequencies.
Temperature tuning: For generation tasks, temperature controls diversity.

Key Takeaways

Logits are raw scores with no restrictions—what the network actually outputs
Softmax converts logits to probabilities using exponentials and normalization
Exponentials amplify differences, making confident predictions sharper
Cross-entropy loss = -log(p_correct)—only the true class matters
The log penalty is harsh: confident wrong predictions are severely punished
Numerical stability matters: use the stable implementation from your framework
The gradient is beautiful: ∂L/∂z = p - y (predicted minus true)