Derivatives are Linear

Studying artificial neural networks involves math that is way above me, but I’ve been plowing through and learning the prerequisite math as I go. A few days ago, I became really excited when I realized that the linearity of differentiation makes differentiating dot products naturally simple.

Sums

This is one of the first rules we learn about taking derivatives, usually by observing examples and concluding ourselves:

where $$u_i(x)$$ could be anything. In other words, the derivative of a sum is the sum of the derivatives. Yet, for a long time until that day, I didn’t see the rule when I did optimizations written this way, with the $$\Sigma$$. I was optimizing the cost function, and I can’t believe it took me so long to see where its derivative comes from.

Dot Products

Let’s try to find the derivative of $$\vec{W} \cdot \vec{X}$$ with respect to all the components $$\vec{X}$$. I did not try to tackle this directly as a whole. I first found the derivative with respect to a single component of $$\vec{X}$$, since this operation generalizes to all of $$\vec{X}$$’s components.

To continue with the theme of breaking ideas down into more elementary ones, let’s rewrite the original expression in terms of scalars. $$x_i$$ denotes a component of $$\vec{X}$$.

and take its derivative with respect to $$x_i$$:1

If we do that for each of the components and compose them into their original structure, we get $$\vec{W}$$. In other words, $\frac{\partial}{\partial \vec{X}} (\vec{W} \cdot \vec{X}) = \vec{W}$.2

What an elegant, simple answer! This is due to the fact that finding the derivative is a linear operation and a dot product is a sum. Each component of $$\vec{X}$$ is only multiplied by its corresponding component in $$\vec{W}$$. For each derivative with respect to $$x_i$$, the linearity of differentiation allows us to calculate the derivative of each term independently and then add them. Since only one term varies with $$x_i$$, the derivatives of the other terms become zero.

If the derivative of a sum did not always equal the sum of derivatives, finding the derivative of a dot product would probably not be as trivial.

Derivatives of odd and even functions

3Blue1Brown’s brilliant video on abstract vector spaces talks about representing functions as vectors and then drills in the point of this article by showing that differentiation is a linear transform of those vectors.3 This section uses the language of vectors to show (not rigorously) how the derivative of an even function is odd and the derivative of an odd function is even.

In 3Blue1Brown’s video we learn that functions can be treated like vectors. He shows that we can represent a polynomial as a vector of its coefficients, starting from the coefficient of $$x^0$$, the constant term. As hinted in the video, the polynomial is a dot product of this vector and $$[1, x, x^2, x^3, x^4, \ldots, x^n ]$$. Most of the other non-polynomial functions have an equivalent Taylor polynomial, so we can apply this reasoning to those functions as well. For example, the infinite list $$[0, 1, 0, -\frac{1}{3!}, 0, \frac{1}{5!}, 0, \ldots]$$ represents $$\sin (x)$$.

Odd polynomials only have odd-degree terms4 and thus only have coefficients on odd indexed terms and zeroes on even indexed terms. The opposite applies to even polynomials.

Let’s go with 3Blue1Brown’s way of taking a derivative by linear transformation of this list. The rule is to multiply each component of the list by its index and shift it one index left. Taking the derivative of a polynomial is equivalent to multiplying its list of coefficients by an infinite matrix.5 The derivative of an odd polynomial represented by $$[0, c_1, 0, c_3, 0, c_5, 0, \ldots]$$ is the even polynomial represented by $$[d_0, 0, d_2, 0, d_4, 0, d_6,\ldots]$$, where $$c_n$$ and $$d_n$$ are placeholders for any number. The shift by one turns any even polynomial into an odd polynomial and vice versa.

We could have reasoned about this without vectors and simply stated that powers of terms decrement by 1, so that any odd term becomes even and vice versa. But by reasoning in terms of vectors and linear transforms, I’ve applied its language to show that the linearity of derivatives makes it deeply connected with linear algebra.

Another Acknowledgement

Sean Don critiqued this essay, pointed out subtle errors, and answered my uncertainties. His past encouragement and advice also set me on a better track to write this essay.

1. We use $$\frac{\partial }{\partial x}$$ instead of $$\frac{\mathrm{d} }{\mathrm{d} x}$$ to indicate partial derivatives of functions of multiple variables, which merely means that we treat the other variables besides $$x$$ as constants. This is in constrast to the total derivative of multivariable functions, which accounts for the possibility that the function inputs depend on each other. Since in this article we assume that function inputs are independent of each other—in other words, changing one input won’t affect others—it’s safe to just use partial derivatives. If this explanation isn’t clear, let me know, and I can try to elaborate more.

2. The derivative of a function (that takes in a vector input) with respect to a vector is also called a gradient, represented by $$\nabla$$, the Greek letter nabla. Matrix calculus is the general way of representing derivatives involving non-scalar quantities in the numerator and/or denominator.

3. The YouTube series that this video is part of, Essence of Linear Algebra, inspired my fascination with linear algebra.

4. Why?

First, let’s make a function $$f(x) = x^n$$. If $$n$$ is even, $$f(x)$$ is even; if $$n$$ is odd, $$f(x)$$ is odd.

Second, the sum of odd functions is odd and the sum of even functions is even. The sum of even and odd functions is neither odd nor even. Therefore, an even function can only be the sum of even functions and an odd function can only be the sum of odd functions.

Combining the two, an even polynomial is the sum of terms with even degrees and an odd polynomial is the sum of terms with odd degrees.

5. The infinite matrix: