C4: Activation Functions Gallery

Why Activation Functions Matter

Imagine stacking linear transformations. Matrix multiply, then another, then another. What do you get? Just one big linear transformation. No matter how many layers, the whole network collapses to a single matrix multiplication.

Layer 1: y = W₁x
Layer 2: z = W₂y = W₂W₁x
Layer 3: out = W₃z = W₃W₂W₁x = Wx  (just one matrix!)

This is useless for learning complex patterns. A line can't fit a curve.

Activation functions break this limitation. By inserting a nonlinear function after each layer, we prevent the collapse. Now deep networks can approximate any continuous function—curves, decision boundaries, anything.

The choice of activation function profoundly affects training dynamics, gradient flow, and what the network can learn.

The Classic: Sigmoid

σ(x) = 1 / (1 + e⁻ˣ)

The sigmoid squashes any input into the range (0, 1). Large positive inputs → near 1. Large negative inputs → near 0. It's smooth, differentiable everywhere, and historically was the default choice.

The derivative: σ'(x) = σ(x) · (1 - σ(x))

This elegant form means we can compute the derivative from the output alone.

The problem: When σ(x) is near 0 or 1, the derivative becomes tiny. Maximum derivative is only 0.25.

Output range: (0, 1)
When to use: Binary classification output layers, gates in LSTMs/GRUs
When to avoid: Hidden layers in deep networks (vanishing gradient)

Zero-Centered: Tanh

tanh(x) = (eˣ - e⁻ˣ) / (eˣ + e⁻ˣ)

Tanh is essentially a rescaled sigmoid. It maps inputs to (-1, 1) instead of (0, 1).

The derivative: tanh'(x) = 1 - tanh²(x)

Why zero-centered matters: Sigmoid outputs are always positive, causing inefficient zig-zag updates. Tanh's outputs center around zero, allowing mixed-sign gradients.

The problem: Same vanishing gradient issue as sigmoid. For |x| > 2, the derivative approaches zero.

Output range: (-1, 1)
When to use: Hidden layers when you need bounded, zero-centered outputs. RNNs sometimes prefer tanh.

The Modern Default: ReLU

ReLU(x) = max(0, x)

Rectified Linear Unit. Elegant simplicity: if positive, pass through; if negative, output zero.

The derivative: ReLU'(x) = 1 if x > 0, else 0

Why ReLU revolutionized deep learning:

No vanishing gradient for positive inputs. Derivative is exactly 1.
Sparse activation. Many neurons output zero, making computation efficient.
Fast to compute. Just a comparison, no exponentials.

The problem: Dead neurons. If a neuron's input is always negative, it outputs zero forever. Zero output means zero gradient. The neuron "dies" and never recovers.

Output range: [0, ∞)
When to use: Default for hidden layers in most architectures.

Fixing Dead Neurons: Leaky ReLU

LeakyReLU(x) = x if x > 0, else αx

Where α is a small positive constant, typically 0.01 or 0.1.

The derivative: LeakyReLU'(x) = 1 if x > 0, else α

Instead of outputting zero for negative inputs, Leaky ReLU outputs a small negative value. This ensures gradients always flow, preventing dead neurons.

Variants:

PReLU (Parametric ReLU): α is learned during training
RReLU (Randomized ReLU): α is sampled randomly during training

Output range: (-∞, ∞)

State of the Art: GELU

GELU(x) = x · Φ(x)

Where Φ(x) is the cumulative distribution function of the standard normal distribution.

Practical approximation:

GELU(x) ≈ 0.5x(1 + tanh(√(2/π)(x + 0.044715x³)))

GELU stands for Gaussian Error Linear Unit. It's the default activation in transformers (BERT, GPT, etc.).

Why GELU works well:

Smooth — no kink, better gradient flow
Non-monotonic — has a small dip around x ≈ -0.5
Approximates expectations — relates to stochastic regularization

Output range: approximately (-0.17, ∞)
When to use: Transformers, modern architectures

The Vanishing Gradient Problem

Here's why activation function choice matters so much for deep networks.

During backpropagation, gradients multiply through layers:

∂Loss/∂w₁ = ∂Loss/∂out · ∂out/∂h₃ · ∂h₃/∂h₂ · ∂h₂/∂h₁ · ∂h₁/∂w₁

Each term includes the activation function's derivative. If that derivative is less than 1, gradients shrink exponentially:

Sigmoid example:

Maximum derivative: 0.25
After 10 layers: 0.25¹⁰ ≈ 0.000001
After 20 layers: effectively zero

ReLU comparison:

Derivative for positive inputs: 1
After 10 layers: 1¹⁰ = 1
After 50 layers: still 1

This is why ReLU and its variants dominate modern architectures.

Choosing an Activation Function

For hidden layers in most networks: Start with ReLU. It's fast, effective, and well-understood.
If you see dead neurons: Try Leaky ReLU or ELU.
For transformers and attention models: Use GELU. It's the standard.
For output layers:
- Binary classification → Sigmoid (gives probability)
- Multi-class classification → Softmax (gives distribution)
- Regression → Linear (no activation) or ReLU for positive outputs
For RNNs/LSTMs: Tanh for cell states, sigmoid for gates

Key Takeaways

Activation functions add nonlinearity, enabling networks to learn complex patterns.
Sigmoid and tanh suffer from vanishing gradients in deep networks.
ReLU revolutionized deep learning by maintaining gradient flow, but can cause dead neurons.
Leaky ReLU and ELU fix the dead neuron problem by allowing small negative outputs.
GELU is the modern default for transformers, combining smoothness with ReLU-like behavior.
The derivative matters more than the function itself for training dynamics.
Deep networks require activation functions with derivatives near 1 for stable training.