12 minute read

Motivation

Consider a 1-D image which contains an object (2 consecutive pixels). The goal is to classify the object color (red, blue, yellow). During training, objects appear only at one position.

image


Would this work? What if, at test time, an object appears at another place?

image


Also, what if the input is a variable-sized image? Hence, this is why we need a new model for vision task: convolutional neural networks (CNNs). As we will see, CNN can be seen as learning a similar model as the visual cortex. It observes the whole image and extract information from that image like human eyes.

Convolution

The convolution between two functions $f, g: \mathbb{R}^D \to \mathbb{R}$ is defined as

\[\begin{align*} [f \circledast g] (\mathbf{z}) = \int_{\mathbb{R}^D} f(\mathbf{u}) g(\mathbf{z} - \mathbf{u}) d \mathbf{u} \end{align*}\]

For understanding, let’s replace the functions by finite-length vectors and think of as functions defined on a finite set of points. For instance, we may think a vector $\mathbf{x} = (0, 1, \cdots, L-1) \in \mathbb{R}^L$ as a function $\mathbf{x}(k) = k$ for $k = 0, 1, 2, \cdots, L-1$. Then, the convolution operation becomes

\[\begin{align*} [\mathbf{w} \circledast \mathbf{x}] (i) = w_0 x_{i} + w_1 x_{i - 1} + \cdots + w_{L-1} x_{i - L + 1} \end{align*}\]


We see that it “flips” the weight vector $\mathbf{w}$, and then “drag” it over the $\mathbf{x}$ vector, summing up the local windows at each point. There is a nearly analogous operation where we don’t flip $\mathbf{w}$ first (it is called cross correlation):

\[\begin{align*} [\mathbf{w} \circledast \mathbf{x}] (i) = w_0 x_{i} + w_1 x_{i + 1} + \cdots + w_{L-1} x_{i + L - 1} \end{align*}\]

In deep learning, the term “convolution” is usually used to mean cross correlation.


Implementation

So far, we only saw 1D inputs, but images are 2D or 3D even with colors. To handle the image data, we may extend this operation to 2d:

\[\begin{align*} [\mathbf{W} \circledast \mathbf{X}] (i, j) = \sum_{u=0}^{H-1} \sum_{v=0}^{W-1} w_{u, v} x_{i+u, j+v} \end{align*}\]

where the size of 2d weight $\mathbf{W}$ is $H \times W$. For instance, colvolving a $3 \times 3$ input $\mathbf{X}$ with a 2d weight $\mathbf{W}$ outputs $2 \times 2$ output $\mathbf{O}$:

\[\begin{align*} \mathbf{O} &= \begin{pmatrix} w_1 & w_2 \\ w_3 & w_4 \\ \end{pmatrix} \circledast \begin{pmatrix} x_1 & x_2 & x_3 \\ x_4 & x_5 & x_6 \\ x_7 & x_8 & x_9 \\ \end{pmatrix} \\ &= \begin{pmatrix} w_1 x_1 + w_2 x_2 + w_3 x_4 + w_4 x_5 & w_1 x_2 + w_2 x_3 + w_3 x_5 + w_4 x_6 \\ w_1 x_4 + w_2 x_5 + w_3 x_7 + w_4 x_8 & w_1 x_5 + w_2 x_6 + w_3 x_8 + w_4 x_9 \\ \end{pmatrix} \end{align*}\]


And in general, the image may have multiple channels like RGB, we have to extend the definition of convolution further. In this case, we define a kernel for each input channel. If number of input channels $C > 1$, we need a kernel that contains a tensor of shape $C \times H \times W$.

image


Mathematically, we compute the feature map by convolving channel $c$ of the input with kernel $\mathbf{W}_{:.:, c}$ and summing over channels:

\[\begin{align*} o_{i, j} = b + \sum_{u=0}^{H-1} \sum_{v=0}^{W-1} \sum_{c=0}^{C-1} x_{i + u, j + v, c} w_{u, v, c} \end{align*}\]

where $b$ is the bias term. The following figure visualizes the operation:

image


Since the output at $(i, j)$ location will be large if the corresponding image patch centered on $(i, j)$ is similar to $\mathbf{W}$, because convolution operation is a cascade of inner products, we can consider such an operation as template matching.

If the template $W$ corresponds to an oriented edge, then convolving with it will cause the output heat map to “light up” in regions that contain edges that match that orientation, as shown in the following figure. Convolving a 2D image (left) with a $3 \times 3$ filter (middle) produces a 2d response map (right). The bright spots of the response map correspond to locations in the image which contain diagonal lines sloping down and to the right.

image


So, convolution is based on the form of feature detection and that’s why the resulting output $\mathbf{O} = \mathbf{W} \circledast \mathbf{X}$ is called a feature map.

Designing a filter is like specifying what I want to detect, and there are many kernels that are designed to detect various kind of features. [3] In traditional computer vision, people manually designed these filters which is neither optimal nor scalable. But in the era of deep learning, we make meaningful filters emerge by learning from a lot of images in a way to optimize the objective. For example, the figure below shows example filters learned by Krizhevsky et al.

image


In many applications, we deeply stack up convolution layers, in other words feature maps. It is like to make a latent abstract image to which another layer of convolution can be applied. And in later posts we will see some standard deep architectures of convolutional neural networks that give rise to exploding breakthroughs of computer vision with deep learning.

image



Stride

Stride is how many pixels (or features) to move to get the next receptive field. It has the effect of dimensionality reduction: a larger stride produces a smaller feature map. And, neighboring outputs will be very similar (highly correlated) in value, since their inputs are overlapping in its receptive field. Strided convolution also reduce such a redundancy.

image


Padding

For the purpose of preserving information at the corners and edges, or in order for a layer to have the same height and width as the previous layer, it is common to add zeros around the inputs (also they need a proper stride). This procedure is called zero padding.

image


There are many different types of padding:

Valid padding doesn’t use zero padding and ignores some rows and columns at the bottom and right of the input image, depending on the stride. This means that every neuron’s receptive field lies strictly within “valid” positions insider the input (it doesn’t go out of bounds), hence the name valid.

image


Same padding uses zero padding if necessary. The output size is set to the number of input neurons divided by the stride, rounded up.
For example, if the input size is $13$ and the stride is $5$ (see the figure below), then the output size is $3$ (i.e., $13 / 5 = 2.6$, rounded up to $3$.) Then zeros are added as evenly as possible around the inputs, as needed.

image


When strides = $1$, the layer’s outputs will have the same spatial dimensions (width and height) as its inputs, hence the name same.


Pooling Layer

Consider a convolutional layer with $5 \times 5$ filters, outputting $200$ feature maps of size $150 \times 100$, with stride $1$ and “same” padding.

image


If the input is a $150 \times 100$ RGB image ( $3$ channels ), then the number of parameters is $(5 \times 5 \times 3 + 1) \times 200 = 15200$ ( $1$ corresponds to the bias, which is fairly small compared to a fully connected layer: $150^2 \times 100^2 \times 3 = 675\text{M}$ parameters per feature map.

Computation
Each of the $200$ feature maps contains $150 \times 100$ neurons, and each of these neurons needs to compute a weighted sum of its $5 \times 5 \times 3 = 75$ inputs: that’s a total of $225 \text{M}$ float multiplications.

Memory
If using 32-bit floats, then the convloution layer’s output will occupy $200 \times 150 \times 100 \times 32 = 96 \text{M}$ bits. ( $12 \text{MB}$ ). And that’s just for one instance—if a training batch contains 100 instances, then this layer will use up $1.2 \text{GB}$ of memory!

In order to reduce the computation load, the memory usage, and the number of parameters, we can subsample inputs. Max pooling and average pooling are the most popular way for this. It also filters noisy activations in a lower layer by abstracting activations in a receptive field with a single representative value.

image


  • Max Pooling
    • Extract the maximum value in the receptive field of previous layer
    • Noise suppression
  • Average Pooling
    • Extract the average value of the receptive field.

image


In practice, max pooling tends to work better than average pooling. One of the most important properties of max pooling is translation invariance. Invariance means, it produces the same output with different inputs. Max pooling provides invariance with small translations (= shifts).

image


Receptive Field

Note that each area of image patch generates a single feature in the output by convolution. The receptive field (of a feature) in a deep learning context is defined as the area in the input that produces the feature of a kernel (the input here can mean the previous layer or the original input). In other words, it measures and represents the spatial extent of local connectivity which is equivalent to the filter size.

image


Then, why do we have to care about the receptive field? Today, receptive field computations are required in a variety of applications. Let’s take an example from dense prediction tasks. In semantic segmentation, the model should produce a prediction of label for each pixel in the input image. Ideally, we want each output map of model to have an enough receptive field so as to ensure that no relevant context was not taken into account. A small receptive field may not be able to recognize large objects.

To calculate the size of receptive field, assume a single path convolutional neural network (with no padding) that has no skip connection such as ResNet and no multiple branches such as inception module for simplicity. Let $r_\ell$ be the receptive field size of the final output feature map $f_L$, with respect to feature map $f_\ell$. Then $r_L = 1$.

We will derive a recurrence formula between $\ell - 1$-th and $\ell$-th layers. Let $k_\ell$ and $s_\ell$ be kernel size and stride of $\ell$-th convolution operation, respectively. First, $r_{L - 1} = k_L$. And generally, to

image


For detail, please refer to this post. https://www.baeldung.com/cs/cnn-receptive-field-size

Summary

In summary,

  • Input size $W_1 \times H_1 \times D_1$
  • Four hyperparameters for convolutional layer:
    • Number of filters $K$
    • Filter size $F$
    • Stride $S$
    • Padding $P$
  • Produces the output of which size is $W_2 \times H_2 \times D_2$:
    • $W_2 = H_2 = (W_1 - F + 2P)/S + 1$
    • $D_2 = K$
  • With parameter sharing, one convolution layer has $F \times F \times D_1$ weights per filter:
    • $(F \times F \times D_1) \times K$ weights
    • $K$ biases

From MLP to CNN

To summarize the difference of CNN from MLP,

  • Locality: No fully-connection from the input. (Used all input dimensions to compute one hidden unit) There is a local 2D parameter matrix, called a filter or a kernel (= weights)

  • Parameter sharing: hidden units are computed by applying the same filter across different local input regions.

image


Backpropagation of CNN

So as to impart convolution layers to our deep learning framework, it is crucial to implement the backward pass of CNN.

Convolution as matrix-vector multiplication

By virtue of linear operator, convolution is able to be represented by matrix multiplication. Let’s take the following convolution operation for example:

\[\begin{align*} \mathbf{O} &= \begin{pmatrix} w_1 & w_2 \\ w_3 & w_4 \\ \end{pmatrix} \circledast \begin{pmatrix} x_1 & x_2 & x_3 \\ x_4 & x_5 & x_6 \\ x_7 & x_8 & x_9 \\ \end{pmatrix} \\ &= \begin{pmatrix} w_1 x_1 + w_2 x_2 + w_3 x_4 + w_4 x_5 & w_1 x_2 + w_2 x_3 + w_3 x_5 + w_4 x_6 \\ w_1 x_4 + w_2 x_5 + w_3 x_7 + w_4 x_8 & w_1 x_5 + w_2 x_6 + w_3 x_8 + w_4 x_9 \\ \end{pmatrix} \end{align*}\]


Then, we may rewrite this to matrix multiplication by flattening the input matrix into vector $\mathbf{x}$ and multiplying by modified filter matrix $\mathbf{C}$:

\[\begin{aligned} \boldsymbol{O} & =\mathbf{C} \boldsymbol{x}=\left(\begin{array}{ccc|ccc|ccc} w_1 & w_2 & 0 & w_3 & w_4 & 0 & 0 & 0 & 0 \\ 0 & w_1 & w_2 & 0 & w_3 & w_4 & 0 & 0 & 0 \\ 0 & 0 & 0 & w_1 & w_2 & 0 & w_3 & w_4 & 0 \\ 0 & 0 & 0 & 0 & w_1 & w_2 & 0 & w_3 & w_4 \end{array}\right)\left(\begin{array}{l} x_1 \\ x_2 \\ x_3 \\ x_4 \\ x_5 \\ x_6 \\ x_7 \\ x_8 \\ x_9 \end{array}\right) \\ & =\left(\begin{array}{l} w_1 x_1+w_2 x_2+w_3 x_4+w_4 x_5 \\ w_1 x_2+w_2 x_3+w_3 x_5+w_4 x_6 \\ w_1 x_4+w_2 x_5+w_3 x_7+w_4 x_8 \\ w_1 x_5+w_2 x_6+w_3 x_8+w_4 x_9 \end{array}\right) \end{aligned}\]

And finally we can recover this output into $2 \times 2$ feature map matrix.

Such a sparsity of kernel of CNN implies the reason why CNN is translation invariance and efficient in terms of the number of parameters than MLP.

Backward pass

Denote a loss function by $L$. Then our interest is $\frac{\partial L}{\partial \mathbf{W}}$ and $\frac{\partial L}{\partial \mathbf{X}}$. Noth that $\frac{\partial L}{\partial \mathbf{O}} \in \mathbb{R}^{H^\prime \times W^\prime}$ if $\mathbf{O} \in \mathbb{R}^{H^\prime \times W^\prime}$

First of all, the backward pass $\frac{\partial L}{\partial \mathbf{X}}$ is simply derived by matrix calculus: $\frac{\partial L}{\partial \mathbf{X}} = \mathbf{C}^\top \frac{\partial L}{\partial \mathbf{O}}$. In other words,

\[\begin{aligned} \frac{\partial L}{\partial \mathbf{X}} = \text{TransposedConvolution}(\mathbf{W}, \frac{\partial L}{\partial \mathbf{O}}) \end{aligned}\]

where $\mathbf{W}$ is a kernel for operation. For the next one, we first apply chain rule:

\[\begin{aligned} \frac{\partial L}{\partial W_{k \ell}} = \sum_{i=0}^{H^\prime - 1} \sum_{j=0}^{W^\prime - 1} \frac{\partial L}{\partial O_{ij}} \cdot \frac{\partial O_{ij}}{\partial W_{k \ell}} \end{aligned}\]

for each pixel of weight $W_{k \ell}$. And from the formula of convolution layer:

\[\begin{aligned} O_{ij} = \sum_{k=0}^{H_\text{kernel} - 1} \sum_{\ell=0}^{W_\text{kernel} - 1} X_{si + k, sj + \ell} \cdot W_{k \ell} \end{aligned}\]

where $s$ is is the stride which we assume that it is same for both height and width for simplicity. Also, bias term is omitted. But it can be extended to more general case trivially. Then, it is clear $\frac{\partial O_{ij}}{\partial W_{k \ell}} = X_{si + k, sj + \ell}$. From this,

\[\begin{aligned} \frac{\partial L}{\partial W_{k \ell}} &= \sum_{i=0}^{H^\prime - 1} \sum_{j=0}^{W^\prime - 1} X_{si + k, sj + \ell} \cdot \frac{\partial L}{\partial O_{ij}} \\ &= \sum_{i=0}^{H^\prime - 1} \sum_{j=0}^{W^\prime - 1} X_{k + si, \ell + sj} \cdot \frac{\partial L}{\partial O_{ij}} \end{aligned}\]

Can you see the similar pattern with convolution formula? Indeed,

\[\begin{aligned} \frac{\partial L}{\partial \mathbf{W}} = \text{Convolution}(\frac{\partial L}{\partial \mathbf{O}}, \mathbf{X}, \text{stride} = 1, \text{dilation} = s) \end{aligned}\]

where $s$ is the stride of the original convolution and $\frac{\partial L}{\partial \mathbf{O}}$ is a kernel for the operation. Although we didn’t consider padded case, the logic is still valid. Note that the output $y_i$ of atrous convolution of 1D input $\mathbf{x}$ with a filter $\mathbf{w}$ of length of $K$ is defined as

\[\begin{aligned} \mathbf{y}_i = \sum_{k=1}^K \mathbf{x}_{i + s \cdot k} \cdot \mathbf{w}_k \end{aligned}\]

where the parameter $s$ corresponds to the stride with which we sample the input signal $\mathbf{x}$. (Standard convolution is a special case for rate $s = 1$.)

Reference

[1] Kevin P. Murphy, Probabilistic Machine Learning: An introduction, MIT Press 2022.
[2] CS231n Convolutional Neural Networks for Visual Recognition
[3] Wikipedia, Kernel (image processing)
[4] Understanding the receptive field of deep convolutional networks by Nikolas Adaloglou

Leave a comment