Loading web-font TeX/Main/Bold

Learning to Find the Derivative of the Quadratic Form

8/26/21

Introduction

One of the many problems I've come across and spent an unhealthy amount of time on is figuring out how to find the derivative of a quadratic form. By definition they are any polynomial with degree 2, however the problem introduced one to me that looks something like this:

xAx ; xRn,ARnxn

I figured since this was in the domain of Calculus and Linear Algebra that any book I had on either of the topics would discuss the process. Now I don't own a lot of books on the topics (2 Calc books, 2 LA books), however I was a little surprised when there was no discussion in any of them that explicitly went through the process of finding the derivative of a map in vector space. In retrospect, it's more apparent to me that even though none of the books explained the process, they did provide all the pieces for me to put together - just putting it together became quickly overwhelming. Segue into the first strategy for solving this problem: explicitly writing out the inner product.

Rewriting to scalar sums

Rewritting the quadratic form into sigma notation provides, at least for me, a more clear step one where you can make use of the product rule: $\begin{equation} \begin{aligned} \frac{d(uv)}{dx} = u\frac{dv}{dx}+v\frac{du}{dx} \end{aligned} \end{equation}$

$\begin{equation} \begin{aligned} {\bf x^\top Ax} = \sum\limits_{j=1}^{n}\sum\limits_{i=1}^{n}x_jx_iA_{ji} = \sum\limits_{j=1}^{n}x_j\sum\limits_{i=1}^{n}x_iA_{ji} \end{aligned} \end{equation}$

It is still easily overwhelming thinking about the inner machinations and the many ways to "multiply" vectors and matrices. It helps when I only consider two terms in a function, compute them, and then move onto the third...fourth...etc. I had to take a pause here on differentiating $\bf x^\top Ax$ to go back and fortify my understanding of basic linear algebra. Once I better understood the inner product (and numerous others), I moved onto representing the right hand side of the equality with the product rule:

$\begin{equation} \begin{aligned} \frac{d {\bf x^\top Ax}}{d x_k} & = \sum\limits_{j=1}^{n}\frac{dx_j}{dx_k} \sum\limits_{i=1}^{n}x_iA_{ji} + \sum\limits_{j=1}^{n}x_j\sum\limits_{i=1}^{n} \frac{dx_i}{dx_k}A_{ji} \ = \sum\limits_{i=1}^{n}x_iA_{ki} + \sum\limits_{j=1}^{n}x_jA_{jk} \end{aligned} \end{equation}$

To understand what's going on here and make it more comprehensible, clarifying that everything will be taken with respect to $x_k$ when going through the product rule (middle term) is important. The purpose of introducing the "x sub k" notation is to give us an index to rewrite the function later on when we recompose it into vector notation. It also helps to reinforce the notion that we are not explicitly differentiating anything. Looking at $\begin{equation} \begin{aligned} \sum\limits_{j=1}^{n}\frac{dx_j}{dx_k} \sum\limits_{i=1}^{n}x_iA_{ji} \end{aligned} \end{equation}$, it is easily misleading to think of attempting to "figure out" $\frac {dx_j}{dx_k}$ like I have (many times), however it is nonsensical. Instead, it is better to look at the relationship of $x_j$ to $A_{ji}$ - it is simply summing the product of each element in $x_j$ with each column element in $A_{ji}$. Then, looking at the corresponding $\sum\limits_{i=1}^{n}x_iA_{ki}$ (first term of the right hand of the equality), the "sub k" notation takes precedence as it is what's being differentiated to. Rinse and repeat for the second term in the middle equation of the equality and you have a finalized derivative of $\bf x^\top Ax$ represented in sigma notation/scalar sums.

A lot of concepts in linear algebra still throw me off because there are ubiquitous idiosyncrasies within it's domain. I have spent hours looking at the same equation answered in different posts online because one answer is represented with numerator layout notation and the other denominator layout notation (and then there's the infrequent Einstein notation). How I currently view it is that as long as you remain consistent with your notation, you're free to use any one you see fit. Anecdotally, most Jacobians I see in the domain of machine learning are represented via numerator layout notation, so I am partial to that style.

Moving on, we now have to rewrite our differentiated formula back to vector notation. It is very simple if you remember the purpose we introduced the "sub k" notation earlier on. Our answer looks something like $\sum\limits_{i=1}^{n}x_iA_{ki} + \sum\limits_{j=1}^{n}x_jA_{jk}$. All we have to do is populate a vector with each element k. Now the information from last paragraph becomes more apparent. This isn't necessarily a distinction between numerator and denominator style, but just an example that shows the flexibility of linear algebra. Both of the right hand sides of the equalities below are our final answers in vector notation - note that one is the transpose of the other and if $\bf A$ is symmetric, they simplify further to $2\bf Ax$.

Column Vector Format:

$$\begin{equation} \begin{aligned} \begin{bmatrix} \sum\limits_{i=1}^{n}x_iA_{1i} + \sum\limits_{j=1}^{n}x_jA_{j1} \\ \sum\limits_{i=1}^{n}x_iA_{2i} + \sum\limits_{j=1}^{n}x_jA_{j2} \\ \vdots \\ \sum\limits_{i=1}^{n}x_iA_{ni} + \sum\limits_{j=1}^{n}x_jA_{jn} \\ \end{bmatrix} = {\bf Ax} + ({\bf x}^\top{\bf A})^\top = ({\bf A} + {\bf A}^\top){\bf x} \end{aligned} \end{equation}$$

Row Vector Format:

$$\begin{equation} \begin{aligned} \begin{bmatrix} \sum\limits_{i=1}^{n}x_iA_{1i} + \sum\limits_{j=1}^{n}x_jA_{j1} & \sum\limits_{i=1}^{n}x_iA_{2i} + \sum\limits_{j=1}^{n}x_jA_{j2} & \dots & \sum\limits_{i=1}^{n}x_iA_{ni} + \sum\limits_{j=1}^{n}x_jA_{jn} \end{bmatrix} = \end{aligned} \end{equation} $$ $$ \begin{equation} \begin{aligned}({\bf Ax} + ({\bf x}^\top{\bf A})^\top)^\top = (({\bf A} + {\bf A}^\top){\bf x})^\top = {\bf x}^\top({\bf A} + {\bf A}^\top) \end{aligned} \end{equation} $$

The Fréchet

There's a reason why this one is first - it's my favorite. The prior explanation with decomposing $\bf x^\top Ax$ into scalar sums is great as well, but this one is pretty cool because it is much simpler. To be more specific I'm a fan of the Fréchet derivative using Landau notation. From Wikipedia (the link): "If there exists such an operator A, it is unique, so we write $Df(x)=A$ and call it the Fréchet derivative of f at x". The goal is now find A for $\bf x^\top Ax$ when instantiated into $f(x+h)=f(x)+Ah+o(h)$ (note: A is not referring to $\bf A$ in the quadratic form but the variable A in the Fréchet).

$$\begin{equation} \begin{aligned} f({\bf x)=x^\top Ax} \Rightarrow \end{aligned} \end{equation} $$ $$\begin{equation} \begin{aligned} f({\bf x + h}) = ({\bf x+h})^\top {\bf A}({\bf x+h}) \Rightarrow \end{aligned} \end{equation} $$ $$ \begin{equation} \begin{aligned} ({\bf x^\top +h^\top})({\bf Ax+Ah}) \Rightarrow \end{aligned} \end{equation} $$ $$ \begin{equation} \begin{aligned} {\bf x^\top Ax+x^\top Ah+ h^\top Ax+ h^\top Ah} \Rightarrow \end{aligned} \end{equation} $$ $$ \begin{equation} \begin{aligned} {\bf x^\top Ax+x^\top Ah+x^\top A^\top h+ h^\top Ah} \Rightarrow \end{aligned} \end{equation} $$ $$ iff ~A=A^\top $$ $$ \begin{equation} \begin{aligned} {\bf x^\top Ax+2{\bf x^\top Ah}+ h^\top Ah} \end{aligned} \end{equation} $$

Above is simple algebra because matrix multiplication is associative and distributive so we're just distributing the transpose and $\bf A$. Distributing the transpose over a product has the nuance of reversing the order of that product: $(AB)^\top = (B^\top A^\top)$. This is how to two middle terms, $x^\top Ah+h^\top Ax$, come together to $2x^\top Ah$ (one is the transpose of the other, matrix A is symmetric, and they evaluate to scalar values). Although all three of the points in the parenthesis are important, it is especially important make note of the potential symmetry in A. The formula in the last line now fits into the template of the Fréchet. Which when we isolate $\bf A$, we get the derivative of $\bf x^\top Ax$

A is symmetric

$$\begin{equation} \begin{aligned} f({\bf x}) = {\bf x^\top Ax} \\ A\bf h = 2{\bf x^\top Ah} \\ o(h) = {\bf h^\top Ah} \end{aligned} \end{equation} $$$$\begin{equation} \begin{aligned} A = 2{\bf x^\top A} \end{aligned} \end{equation} $$

A is not symmetric

$$\begin{equation} \begin{aligned} f({\bf x}) = {\bf x^\top Ax} \\ \end{aligned} \end{equation} $$ $$\begin{equation} \begin{aligned} A\bf h = x^\top Ah+x^\top A^\top h \\ \end{aligned} \end{equation} $$ $$ \begin{equation} \begin{aligned} o(h) = {\bf h^\top Ah} \end{aligned} \end{equation} $$$$\begin{equation} \begin{aligned} A = x^\top A+x^\top A^\top = x^\top(A+A^\top) \end{aligned} \end{equation} $$

Do these A's look familiar? They're syntactically identical to when we solved the problem using sigma notation, just the first A assumes $\bf A$ to be symmetric. This is nice and all, but what happened to the last summand in our expanded form $h^\top Ah$? We saw that it is equal to the "little o-notation" but what does that mean? We say that a function f(h) is o(h) if as h → 0 if $$\lim_{h \to 0} \frac{g(h)}{h} = 0$$ This is nice, but why does $h^\top Ah$ evaluate to 0? Using the Cauchy Schwarz and Matrix norm inequalities we can figure out why. We write $g(h)=h^\top Ah$ which gets the ball rolling.

$$\begin{equation} \begin{aligned} \lim_{h \to 0} \frac{|{\bf h^\top Ah}|}{||{\bf h}||}\leq \lim_{h\to 0}\frac{||{\bf h}||||{\bf Ah}||}{||{\bf h}||}\leq \lim_{h\to 0}\frac{||{\bf h}||||{\bf A}||||{\bf h}||}{||{\bf h}||}= \lim_{h\to 0}||{\bf h}||||{\bf A}||=0 \end{aligned} \end{equation}$$

There's a pretty neat proof that exists for the Cauchy Schwarz inequality that uses the quadratic formula, I think it's worth a watch. Writing out the inequality for $h^\top Ah$ this way provides a clear reason why it evaluates to 0. We can now finally say $ g(h)=h^\top Ah∈o(||h||)$.

The Gateaux

I don't actually have a way to solve for $\bf x^\top Ax$ using the Gateaux, however I thought it was worth metioning because similarly to the Fréchet, my introduction to the Gateaux was due to random answers while I was searching how to differentiate $\bf x^\top Ax$. Before I knew it, I was on a week long journey watching Youtube videos, reading pdfs and tearing through my books. The Gateaux in concept, however, seems to tackle the problem very similarly to the Fréchet. Both derivatives are operable on vector spaces and just as the Fréchet generalizes the idea of the univariate derivative, the Gateaux generalizes the directional derivative.

Chain Rule

The last method I want to talk about is using the chain rule of the total derivative. The only notion (exluding the chain rule I just linked) that you need to accept is $\dfrac{\partial (x^\top y)}{\partial x} = y$. By now this is not a particularly difficult equality to accept but it is further illustrated with:

$$\begin{equation} \begin{aligned} \dfrac{\partial (x^\top y)}{\partial x} = \frac{\partial (y^\top x)}{\partial x} = \begin{bmatrix} \frac{\partial (x_1 y_1)}{\partial x_1} \\ \frac{\partial (x_2 y_2)}{\partial x_2} \\ \vdots \\ \frac{\partial (x_n y_n)}{\partial x_n} \\ \end{bmatrix} = {\bf y} = \nabla_xf \end{aligned} \end{equation}$$

Because $\bf x^\top y$ evaluates to a scalar, this is an example of a derivative of a scalar by vector, also known as the gradient, which is denoted with the "nabla" (upside down triangle) on the right hand. I also somewhat sneakily tossed it in there, but it is very helpful noting that the partial of $\bf x^\top y$ and $\bf y^\top x$ both with respect to x are the same column vector y. Moving on, what was new for me is the chain rule for the total derivative that I linked in the previous paragraph. It looks like:

$$\frac{df(g,h)}{dx} = \frac{d(g(x)^\top)}{dx} \frac{\partial f(g,h)}{\partial g} + \frac{d(h(x)^\top)}{dx} \frac{\partial f(g,h)}{\partial h}$$

The transposes enable cohesion for the vector dot product, they are omitted for scalar values. Ironically, it took some time for me to wrap my head around the idea of the total derivative, because it's very similar to the regular partial derivative. The main difference I found is the partial automatically assumes there to be no intermediate variables. The total derivative rigidly defines the derivative of every variable in the chance that some are intermediate variables.

Substitute $g = \bf Ax$ inside $\bf x^\top Ax$. This gives you $x^\top g$. Now it's obvious that g is an intermediate variable which is a function of x, so to differentiate the problem we need to use the total derivative chain rule. This is important. Again, if we were to use regular partial derivative we would not account for any change in g(x) for any rate of change in x, which is a no go. This is something I got stuck on.

$$\frac{d({\bf x^\top Ax})}{d{\bf x}} = \frac{d{\bf x}^\top g}{d{\bf x}} = \frac{\partial({\bf x}^\top g)}{\partial {\bf x}} + \frac{d({\bf Ax)^\top}} {d{\bf x}}\frac{\partial ({\bf x}^\top g)}{\partial g} \Rightarrow$$ $$g+\frac{d({\bf x^\top A^\top})}{dx}{\bf x} = g+{\bf A^\top x} \Rightarrow \mathrm{sub}\;\mathrm{ back}\ {\bf Ax}\ for\ g \Rightarrow {\bf Ax+A^\top x} = ({\bf A+A^\top)x}$$

I purposefully switch between notations Ax and g (first line, right hand side) to reinforce that $g = \bf Ax$. This procedure is also pretty short (I think it was the shortest) , but the trick of substituting and understanding the derivative of $\bf y^\top x$ and $\bf x^\top y$ w.r.t the same variable evaulate to the same answer took some time for me to get down.

Thoughts

The rigmaroles I went through to learn what was discussed here were pretty annoying. What I just typed on a single page took months for me to understand and get down pat. But it was worth it and I'd probably (hopefully) do it again. I've always wanted to "type" or create a document and this accomplishes that. I tried to differentiate vectors and matrices from scalars clearly by bolding the former, however I want to acknowledge the possibility that I may have forgotton to do so on some variables. Feel free to get in touch with me if you have noticed any inconsistencies, typos/errors or want to talk about food/math/ml/design. Overall, I'd give "learning how to differentiate the quadratic form" a solid 7.3/10 on the fun scale.

- Ryan Lin