1. Adding Functions: Stacking Heights
Before diving into rules, let's be clear about what it means to add functions.
When we write h(x) = f(x) + g(x), we're saying: "At every x, add the height of f to the height of g."
For example, if f(x) = x² and g(x) = sin(x), then:
At x = 2: h(2) = 4 + sin(2) ≈ 4.91
The graph of h is literally the graphs of f and g stacked on top of each other. Every point on h is the sum of the corresponding points on f and g.
2. The Sum Rule
Here's the good news: differentiating sums is exactly as simple as you'd hope.
The Rule
For h(x) = f(x) + g(x):
Or more compactly: (f + g)' = f' + g'
That's it. Differentiate each piece, then add the results.
The Intuition: Slopes Add Up
Think about what a derivative measures — the rate of change, or slope.
If you're walking along h(x) = f(x) + g(x), your height is changing for two reasons:
- f is changing at rate f'(x)
- g is changing at rate g'(x)
Your total rate of change is simply the sum of these individual rates.
Concrete example: Imagine f is increasing at 2 units per second, and g is increasing at 3 units per second. How fast is h = f + g increasing? Obviously 5 units per second. The rates just add.
Quick Example
Differentiate h(x) = x³ + eˣ
- d/dx[x³] = 3x²
- d/dx[eˣ] = eˣ
Therefore: h'(x) = 3x² + eˣ
No tricks, no complications. Just differentiate and add.
3. The Constant Multiple Rule
What if you scale a function by a constant?
The Rule
For h(x) = c · f(x), where c is a constant:
Or: (cf)' = c · f'
Constants "pass through" the derivative unchanged.
The Intuition: Scaling Steepness
If you double a function, you double its steepness everywhere.
Think of it graphically: h(x) = 2f(x) stretches f vertically by a factor of 2. Every slope gets twice as steep. A slope of 3 becomes 6. A slope of -1 becomes -2.
Example: If f(x) = sin(x), then f'(x) = cos(x).
For h(x) = 5·sin(x): h'(x) = 5·cos(x)
Special Case: Negatives
What about h(x) = -f(x)? This is just c = -1:
Negating a function negates its derivative. Uphill becomes downhill, downhill becomes uphill.
4. The Difference Rule
With the sum rule and the constant multiple rule, we get the difference rule for free:
Why? Because f - g is really f + (-1)·g:
(f - g)' = (f + (-1)·g)' = f' + (-1)·g' = f' - g'
Example: Differentiate h(x) = x⁴ - cos(x)
- d/dx[x⁴] = 4x³
- d/dx[cos(x)] = -sin(x)
Therefore: h'(x) = 4x³ - (-sin(x)) = 4x³ + sin(x)
5. Linearity: The Big Picture
The sum rule and constant multiple rule combine into one powerful property called linearity.
The Linear Combination Rule
For any constants a and b:
This extends to any number of terms:
What "Linear Operator" Means
Mathematicians say the derivative is a linear operator. This means it satisfies two properties:
- Additivity: (f + g)' = f' + g'
- Homogeneity: (cf)' = c · f'
Any operation satisfying both is linear. The derivative behaves "nicely" — it respects the structure of addition and scaling.
6. Why This is Wonderful
Linearity is what makes differentiation tractable. Here's the practical payoff:
Break It Down, Build It Up
You can decompose any polynomial (or sum of functions) into pieces:
h(x) = 3x⁴ - 2x² + 7x - 5
Differentiate each term separately:
- d/dx[3x⁴] = 12x³
- d/dx[-2x²] = -4x
- d/dx[7x] = 7
- d/dx[-5] = 0
Combine: h'(x) = 12x³ - 4x + 7
No matter how long the sum, you just handle each piece independently.
Constants Vanish
Notice that d/dx[-5] = 0. The derivative of any constant is zero — constants don't change, so their rate of change is zero.
7. Why This Matters for ML/AI
Linearity isn't just convenient for homework — it's fundamental to how neural networks learn.
Neural Networks Are Sums
A basic neuron computes:
This is a weighted sum — exactly the kind of expression where linearity shines. When computing gradients, each term contributes independently.
Gradients Distribute Over Addition
During backpropagation, when you hit an addition node:
The gradient flows through unchanged to both inputs:
∂L/∂y = ∂L/∂z
This is the sum rule in action. The upstream gradient copies directly to both branches. No transformation, no scaling — addition just passes gradients through.
Sum Nodes Are "Free" in Backprop
Computationally, sum nodes are trivial during backpropagation. Compare:
- Multiplication: Requires the product rule — gradient depends on both inputs
- Addition: Gradient just copies through — no computation needed
Residual Connections: The Secret Weapon
ResNets (Residual Networks) revolutionized deep learning with a simple idea:
Instead of learning a transformation F(x), learn a residual F(x) and add it to the input.
Why does this help? Linearity. The gradient with respect to x flows through two paths:
- Through F(x) — may be complex, may vanish
- Through the direct connection (+x) — always passes through unchanged
That "+x" creates a gradient superhighway. Even if F's gradients vanish, the identity path preserves gradient flow. This is why ResNets can train hundreds of layers deep while plain networks fail at 20 layers.
Summary
Sum Rule: (f + g)' = f' + g' — differentiate and add.
Constant Multiple Rule: (cf)' = c · f' — constants pass through.
Difference Rule: (f - g)' = f' - g' — follows from above.
Linearity: d/dx[af + bg] = af' + bg' — the derivative respects linear combinations.
Why it matters:
- Break complex functions into simple pieces
- Differentiate each piece independently
- Combine the results
In ML: Gradients distribute over sums effortlessly. Addition nodes in neural networks pass gradients through unchanged. Residual connections exploit this to enable training of extremely deep networks.