2 minute read

In this post, we use the following convention:

\[\begin{align*} \mathbf{x} = (x_1, \cdots, x_n)^\top, \mathbf{y} = (y_1, \cdots, y_m)^\top \\ \\ \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \begin{pmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_1}{\partial x_n} \\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_n} \\ \end{pmatrix} \end{align*}\]

For a function $f(x_1, \cdots, x_n) \in \mathbb{R}$, the gradient $\nabla f$ is

\[\begin{align*} (\frac{\partial f}{\partial x_1}, \cdots, \frac{\partial f}{\partial x_n})^\top = \frac{\partial f}{\partial \mathbf{x}}^\top \end{align*}\]

Matrix Calculus

Basics

Suppose $\mathbf{A}$ is independent of $\mathbf{x}$ and $\mathbf{x} = (x_1 (\mathbf{z}), \cdots, x_n (\mathbf{z}))$ for some $\mathbf{z}$.

\[\begin{align*} &\frac{\partial \mathbf{Ax}}{\partial \mathbf{x}} = \mathbf{A} \\ &\frac{\partial \mathbf{Ax}}{\partial \mathbf{z}} = \mathbf{A} \frac{\partial \mathbf{x}}{\partial \mathbf{z}} \\ &\frac{\partial \mathbf{y}^\top \mathbf{Ax}}{\partial \mathbf{x}} = \mathbf{y}^\top \mathbf{A} \\ &\frac{\partial \mathbf{y}^\top \mathbf{Ax}}{\partial \mathbf{y}} = \mathbf{x}^\top \mathbf{A}^\top \\ &\frac{\partial \mathbf{x}^\top \mathbf{Ax}}{\partial \mathbf{x}} = \mathbf{x}^\top (\mathbf{A} + \mathbf{A}^\top) \end{align*}\]
$\mathbf{Proof.}$
\[\begin{align*} \frac{\partial y_i}{\partial x_j} = \frac{\partial}{\partial x_j} (\sum_{l = 1}^n A_{il} x_l ) = A_{ij}._\blacksquare \end{align*}\] \[\begin{align*} \frac{\partial y_i}{\partial z_j} = \frac{\partial}{\partial z_j} (\sum_{l = 1}^n A_{il} x_l (\mathbf{z}) ) = \sum_{l = 1}^n A_{il} \frac{\partial x_l}{\partial z_j} = (A \frac{\partial \mathbf{x}}{\partial \mathbf{z}})_{ij}._\blacksquare \end{align*}\]

For 3rd one, note that $\frac{\partial \alpha}{\partial \mathbf{x}}$, of which shape is same with $\mathbf{x}.$

\[\begin{align*} &\frac{\partial \alpha}{\partial x_j} = \frac{\partial}{\partial x_j} (\sum_{i} \sum_{k} y_i A_{ik} x_k) = \sum_i y_i A_{ij} = (\mathbf{y}^\top \mathbf{A})_{j} \\ &\frac{\partial \alpha}{\partial y_j} = \frac{\partial}{\partial y_j} (\sum_{i} \sum_{k} y_i A_{ik} x_k) = \sum_k A_{jk} x_k = (\mathbf{A} \mathbf{x})_{j} \\ \end{align*}\]

Or, someone may utilize $\alpha = \alpha^T = \mathbf{x}^T \mathbf{A}^T \mathbf{y}$ as $\alpha$ is constant. $_\blacksquare$

Lastly,

\[\begin{align*} \frac{\partial \alpha}{\partial x_j} = \frac{\partial}{\partial x_j} (\sum_{i} \sum_{k} x_i A_{ik} x_k) = \sum_i x_i A_{ij} + \sum_k A_{ik} x_k = (\mathbf{x}^\top \mathbf{A})_j + (\mathbf{x}^\top \mathbf{A}^\top)_j ._\blacksquare \end{align*}\]


Additionally, if we suppose $\mathbf{y} = (y_1 (\mathbf{z}), \cdots, y_m (\mathbf{z}))^T,$ for some $\mathbf{z}$, by chain rule,

\[\begin{align*} &\frac{\partial \mathbf{y}^\top \mathbf{x}}{\partial \mathbf{z}} = \mathbf{x}^\top \frac{\partial \mathbf{y}} {\partial \mathbf{z}} + \mathbf{y}^\top \frac{\partial \mathbf{x}} {\partial \mathbf{z}} \\ &\frac{\partial \mathbf{y}^\top \mathbf{Ax}}{\partial \mathbf{z}} = \mathbf{y}^\top \mathbf{A} \frac{\partial \mathbf{x}} {\partial \mathbf{z}} + \mathbf{x}^\top \mathbf{A}^\top \frac{\partial \mathbf{y}} {\partial \mathbf{z}} \\ &\frac{\partial \mathbf{x}^\top \mathbf{Ax}}{\partial \mathbf{z}} = \mathbf{x}^\top (\mathbf{A} + \mathbf{A}^\top) \frac{\partial \mathbf{x}} {\partial \mathbf{z}} \end{align*}\]





Now, let’s extend to the matrix differentiation. One can prove easily:

\[\begin{align*} &\frac{\partial}{\partial \mathbf{X}} (\mathbf{a}^\top \mathbf{Xb}) = \mathbf{ab}^\top \\ &\frac{\partial}{\partial \mathbf{X}} (\mathbf{a}^\top \mathbf{X}^\top \mathbf{b}) = \mathbf{ba}^\top \\ \end{align*}\]

Based on these, we will show some useful result of differentiation of formulas.




Determinant

\[\begin{align*} &\frac {\partial} {\partial \mathbf{A}} \text{det} (\mathbf{A}) = \text{det} (\mathbf{A}) (\mathbf{A}^{-\top}) \\ &\frac {\partial} {\partial \mathbf{A}} \text{det} (\mathbf{XAY}) = \text{det} (\mathbf{XAY}) (\mathbf{A}^{-\top}) \end{align*}\]
$\mathbf{Proof.}$

Let $\text{det}(\mathbf{A})$ be the determinant of matrix $\mathbf{A} = (a_{ij})$ of size $n$, and $\mathbf{C} = (c_{ij})$ be the cofactor matrix of $\mathbf{A}$.

From $\text{det}(\mathbf{A}) = \sum_{i=1}^n a_{ij} c_{ij}$ for any $j$, or $\text{det}(\mathbf{A})= \sum_{j=1}^n a_{ij} c_{ij}$ for any $i$, and $\mathbf{A}^{-1} = \frac{1}{\text{det}(\mathbf{A})} \text{adj} (\mathbf{A})$, where $\text{adj} (\mathbf{A}) = \mathbf{C}^\top$, we obtain

\[\begin{align*} &\frac {\partial} {\partial a_{ij}} \text{det} (\mathbf{A}) = c_{ij} = \text{det} (\mathbf{A}) (\mathbf{A}^{-1})_{ji}. \\ \Rightarrow &\frac {\partial} {\partial \mathbf{A}} \text{det} (\mathbf{A}) = \text{det} (\mathbf{A}) (\mathbf{A}^{-\top}) \end{align*}\]

From $\text{det}(\mathbf{AB}) = \text{det}(\mathbf{A}) \text{det}(\mathbf{B})$,

\[\begin{align*} \frac {\partial} {\partial \mathbf{A}} \text{det} (\mathbf{XAY}) = \text{det} (\mathbf{XAY}) (\mathbf{A}^{-\top}) \end{align*}\]





Trace

\[\begin{align*} &\frac {\partial} {\partial \mathbf{A}} \text{tr} (\mathbf{A}) = \mathbf{I} \\ &\frac {\partial} {\partial \mathbf{A}} \text{tr} (\mathbf{XAY}) = \mathbf{X}^\top \mathbf{Y}^\top \\ \end{align*}\]

Let $\text{tr}(\mathbf{A})$ be the trace of matrix $\mathbf{A} = (a_{ij})$ of size $n$. Note that $\text{tr}(\mathbf{A}) = \sum_{i=1}^n a_{ii}$

Leave a comment