B4: Chain Rule Visualiser — Neural Pathways

1. Function Composition: The Pipeline

Before we tackle the chain rule, let's make sure we're solid on function composition. When you write f(g(x)), you're describing a two-step process:

First, g takes x and produces some output
Then, f takes that output and produces the final result

Think of it as an assembly line. The raw material x enters the first machine (g), which transforms it into an intermediate product g(x). That intermediate product then enters the second machine (f), which transforms it into the final output f(g(x)).

For example, if g(x) = 3x + 1 and f(u) = u², then: f(g(x)) = f(3x + 1) = (3x + 1)²

This might seem like a simple concept, but it's the foundation of everything that follows. Neural networks are nothing more than deeply nested function compositions—dozens or hundreds of functions applied in sequence.

2. The Pipeline Metaphor

Let's extend our assembly line into a full pipeline visualization. Imagine data flowing through a series of tubes:

x → [g] → g(x) → [f] → f(g(x))

Each box is a function that processes whatever flows into it. This is exactly how a neural network processes data:

input → [Layer 1] → [Layer 2] → [Layer 3] → ... → output

The key insight is that we can make this pipeline arbitrarily long. Three functions? No problem:

x → [h] → h(x) → [g] → g(h(x)) → [f] → f(g(h(x)))

Each stage does its own transformation, blissfully unaware of what came before or what comes after.

3. The Problem: Sensitivity Through the Chain

Here's the question that leads us to the chain rule:

If I wiggle the input x by a tiny amount, how much does the final output wiggle?

This is asking for the derivative of the composite function. But here's the challenge: the effect of changing x has to propagate through every stage of the pipeline. A small change in x causes a change in g(x), which in turn causes a change in f(g(x)).

Let's think about this concretely. Suppose:

When x changes by 1, g(x) changes by 2 (so g'(x) = 2)
When g's output changes by 1, f's output changes by 3 (so f'(g(x)) = 3)

If x increases by a tiny amount Δx:

g(x) increases by approximately 2·Δx
f(g(x)) increases by approximately 3·(2·Δx) = 6·Δx

The overall amplification is 2 × 3 = 6. The derivatives multiply.

4. Intuitive Chain Rule: Derivatives Multiply Through

This is the core insight of the chain rule, and it's worth burning into your brain:

When functions are composed, their derivatives multiply.

Think about why this makes sense. Each function in the chain either amplifies or dampens changes passing through it. If the first function doubles everything (derivative = 2) and the second function triples everything (derivative = 3), then the overall effect is to multiply by 6.

It's like compound interest, or gear ratios. If one gear doubles the rotation speed and the next triples it, the total speedup is 6×.

5. The Chain Rule Formula

Now we can state the chain rule precisely:

d/dx[f(g(x))] = f'(g(x)) · g'(x)

In words: "The derivative of a composition equals the derivative of the outer function (evaluated at the inner function) times the derivative of the inner function."

Some people remember this as "outer derivative times inner derivative." Others use the mnemonic "derivative of the outside, leave the inside alone, times derivative of the inside."

There's an equivalent notation using Leibniz's dy/dx style. If we let u = g(x), then:

dy/dx = (dy/du) · (du/dx)

This looks like fraction multiplication where the du's "cancel." They don't actually cancel, but the notation is suggestive and helps many people remember the rule.

6. Step-by-Step Example

Let's work through a complete example. Consider: h(x) = (3x + 1)²

Step 1: Identify the composition

Outer: f(u) = u²
Inner: g(x) = 3x + 1

Step 2: Find the individual derivatives

f(u) = u² → f'(u) = 2u
g(x) = 3x + 1 → g'(x) = 3

Step 3: Apply the chain rule

h'(x) = f'(g(x)) · g'(x) = 2(3x + 1) · 3 = 6(3x + 1)

Step 4: Verify (optional)

Expand h(x) = (3x + 1)² = 9x² + 6x + 1, then h'(x) = 18x + 6 = 6(3x + 1) ✓

7. Multiple Links in the Chain

The chain rule extends naturally to longer compositions. For three functions:

d/dx[f(g(h(x)))] = f'(g(h(x))) · g'(h(x)) · h'(x)

Just keep multiplying. Each derivative is evaluated at its input—the value that actually flows into that stage of the pipeline.

For a chain of n functions:

d/dx[f₁(f₂(f₃(...fₙ(x)...)))] = f₁' · f₂' · f₃' · ... · fₙ'

This is beautiful in its simplicity. No matter how long the chain, you just multiply all the derivatives together.

8. Why This Matters for Machine Learning

The chain rule isn't just another calculus technique—it's the mathematical foundation of how neural networks learn.

Neural Networks Are Function Compositions

A neural network is literally a composition of functions:

input → [Layer 1] → [Layer 2] → ... → [Layer L] → prediction → [Loss]

The entire network, from input to loss, is one giant composite function.

The Learning Problem

Training a neural network means adjusting the weights to minimize the loss. To do this with gradient descent, we need to know: how does each weight affect the loss?

But any given weight in Layer 1 affects the loss only indirectly. Changing that weight changes Layer 1's output, which changes Layer 2's output... eventually affecting the loss.

Backpropagation Is the Chain Rule

Backpropagation—the algorithm that makes deep learning possible—is just the chain rule applied systematically.

During the forward pass, data flows from input to output. During the backward pass, gradients flow from output to input:

∂L/∂x ← Layer 1 ← Layer 2 ← ... ← Layer L ← ∂L/∂Loss = 1

At each layer, we multiply by that layer's local derivative. The gradient with respect to early layers is the product of all the derivatives along the path—exactly what the chain rule says.

9. The Computational Graph View

There's a powerful visual way to think about the chain rule: the computational graph.

A computational graph represents a computation as a network of nodes and edges:

Nodes are operations (add, multiply, square, apply activation function, etc.)
Edges carry values from one operation to the next

During the forward pass, values flow left to right. During the backward pass, gradients flow right to left, getting multiplied at each node.

This is exactly how automatic differentiation frameworks like PyTorch and TensorFlow work. When you call .backward(), the framework traverses the graph in reverse, applying the chain rule at each node.

Wrapping Up

The chain rule is deceptively simple: when functions are composed, their derivatives multiply. But this simple rule is the engine that powers all of deep learning.

When you train a neural network, you're asking: "How does each weight affect the final loss?" The answer requires tracing through every layer of the network, accounting for how changes propagate through each transformation. The chain rule tells you exactly how to do this: multiply the local derivatives together.

Backpropagation is just the chain rule applied efficiently. The "backward" pass flows gradient information from the loss back to each weight, multiplying by local derivatives at every step.

In your visualization, watch how a small change at the input creates a cascade of changes through the pipeline. Each stage amplifies or dampens the effect. The final sensitivity—the derivative of the composition—is the product of all those individual sensitivities.

The chain rule transforms a daunting question into a series of simple local questions, combined by multiplication. That's the magic that makes deep learning work.