# On Computing Derivatives

Wed, 18 Aug 2021

In this blog post, I review techniques for computing $\frac{\partial}{\partial x_i} f$ for $f: \mathbb{R}^n \to \mathbb{R}^m$. Broadly, there are three techniques: numerical, symbolic, and automatic differentiation. Among them, the most remarkable and useful is the Automatic Differentiation (AD) technique in its reverse mode (or adjoint mode). I will show you how, using first principles, you could've derived it yourself. (Do not expect code in this blog post, but expect a lot of equations.)

In practice, every time you face an optimization problem, you will have to compute a derivative. Gradient-descent techniques require the first-order derivative, and Newton's method requires a second-order one. If you still need motivation to carry on, consider that training a neural network is an optimization problem. The de-facto neural network training algorithm is Backpropagation. It starts by computing its derivative, updating the weights in reverse, rinsing, and repeating... And where does AD come into play? Well, Backpropagation is a special case of AD. It's also precisely this reverse-mode AD I mentioned earlier. And what's more, Backpropagation is an optimized way of implementing reverse-mode AD. So stick around if you want to know its origins and why you propagate backward.

Some of my sources and recommended readings are:

## Numerical Differentiation

First, a foreward: In this section, I will only consider $f: \mathbb{R} \to \mathbb{R}$ that is once, twice, or four times differentiable. But the techniques are easily generalizable to any $f: \mathbb{R}^n \to \mathbb{R}^m$.

To compute derivatives the first idea is to use the mathematical definition of the derivative $$\frac{d}{dx} f(x_0) = \lim_{h \to 0} \frac{f(x+h)-f(x)}{h}$$ We can choose a small $\varepsilon > 0$ and compute $\frac{f(x+\varepsilon)-f(x)}{\varepsilon}$.

This is an approximation and not an exact value, which begs the question of how good is this approximation? Using Taylor's theorem, we can answer that question. We know that $$f(x) = f(x_0) + f'(x_0) (x-x_0) + O\left((x-x_0)^2\right)$$ This $O(\cdot)$ is the big-O notation that we computer scientists know and love. Setting $x \to x+\varepsilon$, and $x_0 \to x$, we get $$f(x+\varepsilon) = f(x) + f'(x)\varepsilon + O(\varepsilon^2)$$ and after rearranging: $$f'(x) = \frac{f(x+\varepsilon)-f(x)}{\varepsilon} + O(\varepsilon)$$ which means that our error is proportional to $\varepsilon$; cutting $\varepsilon$ in half will cut the error in half too.

Can we approximate better? Luckily, a fancy trick with a fancy name comes to the rescue: central differencing scheme. Is tells us to use the following approximation: $$f'(x) \approx \frac{f(x+\varepsilon)-f(x-\varepsilon)}{2h}$$ To analyze its error, we use Taylor's theorem twice. Once with $x \to x+\varepsilon$ and $x_0 \to x$, and once again with $x \to x-\varepsilon$ and $x_0 \to x$. This leads to these two identities: $$f(x+\varepsilon) = f(x) + f'(x)\varepsilon + \frac{f''(x)}{2}\varepsilon^2 + O(\varepsilon^3)$$ $$f(x-\varepsilon) = f(x) - f'(x)\varepsilon + \frac{f''(x)}{2}\varepsilon^2 + O(\varepsilon^3)$$ Subtracting the second identity from the first and rearranging we get: $$f'(x) = \frac{f(x+\varepsilon)-f(x-\varepsilon)}{2h} + O(\varepsilon^2)$$ which means that our error decreases quadratically with $\varepsilon$; cutting $\varepsilon$ in half will cut the error by four!

Can we still do better? I will guide you through a scheme that has an error of $O(\varepsilon^4)$ which I hope will show you how to generalize this technique without having me spell it out. Somehow symmetry plays a role in improving the error, so inspired by the central differencing scheme, let's apply Taylor's theorem at these four points: $x-2\varepsilon$, $x-\varepsilon$, $x+\varepsilon$, and $x + 2\varepsilon$. We get: $$f(x+2\varepsilon) = f(x) + 2f'(x)\varepsilon + 2f''(x)\varepsilon^2 + \frac{4f'''(x)}{3}\varepsilon^3 + \frac{2f''''(x)}{3}\varepsilon^4 + O(\varepsilon^5)$$ $$f(x+\varepsilon) = f(x) + f'(x)\varepsilon + \frac{f''(x)}{2}\varepsilon^2 + \frac{f'''(x)}{6}\varepsilon^3 + \frac{f''''(x)}{24}\varepsilon^4 + O(\varepsilon^5)$$ $$f(x-\varepsilon) = f(x) - f'(x)\varepsilon + \frac{f''(x)}{2}\varepsilon^2 - \frac{f'''(x)}{6}\varepsilon^3 + \frac{f''''(x)}{24}\varepsilon^4 + O(\varepsilon^5)$$ $$f(x-2\varepsilon) = f(x) - 2f'(x)\varepsilon + 2f''(x)\varepsilon^2 - \frac{4f'''(x)}{3}\varepsilon^3 + \frac{2f''''(x)}{3}\varepsilon^4 + O(\varepsilon^5)$$ Let's multiply the first by $-\frac{1}{21}$, the second by $\frac{8}{21}$, the third by $-\frac{8}{21}$, and the fourth by $\frac{1}{21}$. Then if we add them all together and rearrange, we get: $$f'(x) = \frac{f(x-2\varepsilon)-8f(x-\varepsilon)+8f(x+\varepsilon)-f(x+2\varepsilon)}{21\varepsilon} + O(\varepsilon^4)$$ meaning that cutting $\varepsilon$ in half will cut the error by sixteen!

If you can figure out where these $\pm\frac{1}{21}$ and $\pm\frac{8}{21}$ magic numbers come from then you can generalize this technique to your heart's desire and reduce the error as much as you want; as low as $O(\varepsilon^{64})$ if you choose to. The astute reader can generalize this technique even further to reach Richardson extrapolation.

The first criticism is that this scheme will always yield an approximation and never an exact value. That criticism becomes critical when we consider actual implementations using floating-point numbers: If we choose $\varepsilon$ to be small while we're interested in computing the derivative at a relatively large $x$ (orders of magnitude larger than $\varepsilon$) then we'll have to add a small number to a large number: the first capital sin. Moreover, if our function is smooth enough, then $f(x-\varepsilon)$ and $f(x+\varepsilon)$ will be close to each other, and we end up subtracting numbers of a similar magnitude: the second capital sin. Both are well-known problems that should be avoided when dealing with floating-point arithmetic. These two problems and some more are highlighted in Losing My Precision. And, if that is not enough, choosing too-small an $\varepsilon$ leads us to commit the Deadly Sin #4: Ruination by Rounding (explained in McCartin's Seven Deadly Sins of Numerical Computation (1998) doi:10.1080/00029890.1998.12004989 which you should not use sci-hub to read a free version of this). Shortly, a tiny $\varepsilon$ will reduce the truncation error — the approximation error — but will spike the round-off errors too — rounding error due to floating points! And to make matters even worse, quoting from that same paper, "The perplexing facet of this phenomenon is that the precise determination of $h_{opt}$ [the optimal $\varepsilon$ that reaches an equilibrium between trunction and round-off errors] is impossible in practice."

The second criticism is that computing $f'(x)$ (for an error of $O(\varepsilon^4)$) calls the function $f$ four times. If $f$ is expensive to call, computing the derivative becomes quadruply expensive. What if $f$ was a deep neural network with millions of inputs? In other words, $f: \mathbb{R}^{1,000,000} \to \mathbb{R}$, then to learn via gradient-descent requires computing $\frac{\partial}{\partial x_i}f$ for all million parameters resulting in 4,000,000 evaluations of the neural network for one optimization iteration.

## Symbolic Differentiation

The previous paragraph introduced the two downsides of numerical methods. They only approximate, and they perform multiple applications per input variable. Symbolic Differentiation attempts to eliminate both concerns, with a slight caveat mentioned at the end. Whereas numerical methods are done at runtime, symbolic differentiation is (generally) done at compile-time.

Numerical methods were inspired by the definition of the derivative. Symbolic methods are inspired by the specific definition of some derivatives. What do I mean? We all learned in high school that $\left(x^n\right)' = nx^{n-1}$, and that $(uv)' = u'v + uv'$, etc... Notice that these definitions do not use any $\varepsilon$ and are exact. We use $=$ not $\approx$. So let's make use of them, alongside the insight that a program is just a complicated composition of basic ideas like multiplication, addition, composition, etc... which we know their derivatives.

Provided a program encoding a mathematical expression a compiler manipulates its symbols (a lot like how a high-school student blindly applies the rules of differentiation) to generate another program that will compute the derivative in that exact way.

In a basic programming language with a syntax defined like:  t ::= c (c is a real number) | x (variables) | t1 + t2 | t1 - t2 | t1 * t2 | t ^ c (to a power of a constant) | log t | exp t | sin t | cos t  We can define a recursive function d which given the variable x to differentiate with respect to, operates on a syntax tree to produce another syntax tree representing the derivative. In an ML-like syntax we can define this function like so:  d x c = 0 d x x = 1 d x y = 0 d x (t1 + t2) = (d x t1) + (d x t2) d x (t1 - t2) = (d x t1) - (d x t2) d x (t1 * t2) = t1 * (d x t2) + (d x t1) * t2 d x (t ^ c) = c * t ^ (c-1) * (d x t) d x (log t) = (d x t) * t ^ -1 d x (exp t) = (d x t) * (exp t) d x (sin t) = (d x t) * (cos t) d x (cos t) = (d x t) * -1 * (sin t)