29 minute read

In today’s post, we’re going to discuss about some convolution neural networks’ architectures in the past that has been used for image classification. Through these architectures we want to learn how to improve the efficiency of the convolution neural network and also what is the key factors that enhance the performance. And then at the very last we’re going to learn about some training tips for CNN.

  • Case study: CNN architectures for image classification

This figure is the classification performance on the ImageNet benchmark, one of the most standard and popular image classification dataset that has roughly 1000 classes in the challenge. Here we are plotting the performance (error rates) of this challenge over the years. And we see that since the 2012 we get some breakthrough and then we get improved performance over the years.

image

You can see that the first breakthrough happened in 2012 and it was the first convolution neural network (AlexNet) applied in this benchmark. And after that we get some more and more architectures came out. One notable changes in this architecture is the number, the depth, the number of layers of network. The AlexNet started with the 8 layers and then over the years people tried more deeper architectures with more parameters mainly due to some accumulated know hows on how to train the architect, how to train and design the neural networks and also increased computation power. We’re going to larn these architectures one by one today.

AlexNet [Krizhevsky et al. 2012]

We’re going to start with the AlexNet which was the first CNN that accelerated the deep learning era in vision and also the other domains.

It’s basically 8 layer CNNs where the first 5 layers are convolution layers and the last 3 layers are fully connected layers. We usually visualize AlexNet in this way.

image

One thing I want to mention in the figure is that it’ll be a little bit confusing at the first glance because it seems like there is two branches. This is because at that time the GPU was poor with limited memories in the GPU compared to the ones in these days. To overcome computing limitation, what the authors did was that they just decomposed the network into two parts so that they trained this network in two GPUs. But actually it acts as a just a single branch.

For instance this first layer seems like to have two input convolution layers with 48 dimensions of its channel but it’s actually just one convolution layer with 96 dimensions.

Architecture

The rough specification of the AlexNet is as follows.

  • $[227 \times 227 \times 3]$ $\text{Input}$
  • $[55 \times 55 \times 96]$ $\text{Conv}1(96, \text{ kernel }= 11 \times 11, \text{ stride } = 4, \text{ pad } = 0) + \text{ ReLU }$ $\to 35\text{K}$
  • $[27 \times 27 \times 96]$ $\text{MaxPool}1(\text{ kernel }= 3 \times 3, \text{ stride } = 2)$ 초록이
  • $[27 \times 27 \times 96]$ $\text{Norm}1$
  • $[27 \times 27 \times 256]$ $\text{Conv}2(256, \text{ kernel }= 5 \times 5, \text{ stride } = 1, \text{ pad } = 2) + \text{ ReLU }$ $\to 614\text{K}$ 퍼렁색
  • $[13 \times 13 \times 256]$ $\text{MaxPool}2(\text{ kernel }= 3 \times 3, \text{ stride } = 2)$ 초록이
  • $[13 \times 13 \times 256]$ $\text{Norm}2$
  • $[13 \times 13 \times 384]$ $\text{Conv}3(384, \text{ kernel }= 3 \times 3, \text{ stride } = 1, \text{ pad } = 1) + \text{ ReLU }$ $\to 885\text{K}$퍼렁색
  • $[13 \times 13 \times 384]$ $\text{Conv}4(384, \text{ kernel }= 3 \times 3, \text{ stride } = 1, \text{ pad } = 1) + \text{ ReLU }$ $\to 1327\text{K}$퍼렁색
  • $[13 \times 13 \times 256]$ $\text{Conv}5(256, \text{ kernel }= 3 \times 3, \text{ stride } = 1, \text{ pad } = 1) + \text{ ReLU }$ $\to 885\text{K}$퍼렁색
  • $[6 \times 6 \times 256]$ $\text{MaxPool}3(\text{ kernel }= 3 \times 3, \text{ stride } = 2)$ 초록이
  • $[4096]$ $\text{FC}6(9216 \times 4096)$ $\to 37749\text{K}$
  • $[4096]$ $\text{Dropout}(p = 0.5)$ $\to 16777\text{K}$
  • $[4096]$ $\text{FC}7(4096 \times 4096)$ $\to 4096\text{K}$
  • $[4096]$ $\text{Dropout}(p = 0.5)$
  • $[1000]$ $\text{FC}8(4096 \times 1000)$

It was the first CNN that applid the ReLU function as a nonlinearity. Before, the more popular choice was the sigmoid or the tanh, which are known as the saturating function that cause slow optimization. And it turned out that choosing the ReLU was one of the key choices for designing this large deep convolution neural network and it was one of the main contribution of this network.

Another noticeable design is that it applies the convolution with a very large stride and kernel size at the very first layer due to some practical issues. To fit the large network into the bad GPU, they had to reduce the feature dimension dramatically at the very first layer. And followed by large stride, they also had to increase the size of the kernel in order to cover all the pixels of the image.

They also applied some normalization layers just for making the optimization more efficient, in other words make the gradient more stable. But we will skip the illustration of this normalizing layer because we are not using it anymore.

Lastly, the reason that they applied dropout at the last fc layers only is because the number of parameters of the networks are concentrated over this fc layer therefore they can easily overfit the data.

ZFNet

The authors especially several years to design this AlexNet architecture and during the several years to tune these hyperparameters. And after that, people figured out that maybe the CNN is very promising tool for this kind of large image dataset. They tried to tune the architecture further, and the next winner of the challenge int the next year (2013) was also based on the CNN, called ZFNet.

Compared to the AlexNet Zfnet is very similar: it usually refers to as better tuning version of network parameters over AlexNet. This is just a rough illustration of the network.

image

You can see that it is actually AlexNet, but selects little different hyperparameters. In the first convolution layer in the AlexNet, the kernel size was 11 by 11 with stride 4, but the authors changes this to 7 by 7 kernel with stride of 2. Hence, ZFNet looks pixels in a more fine-grain manner, dropping less information at the first layer. They also changed the specification of the following convolution layers (384, 384, 256 filters to 512, 1024, 512).

All they did was the hyperparameter tuning, but the improvement was significant, almost like 5% improvement in terms of the error rate.

VGGNet

The next winner was called VGGNet, one of the popular and standard architecture in CNN architectures. The main idea behind the VGGNet was to build a deeper network, in other words stack more layers.

To stack more layers, we have to maintain the number of parameters, but it usually cause overfitting and inefficiency compared to the given dataset. Hence, we need to be a bit carefult about how to design the deeper network without redundant number of parameters.

The authors rather found the simple principle to design the convolutional network. They unified the kernel size of all convolutional layer to 3 by 3 and just doubled the channel size at every convolutional blocks separated by the pooling layer. With this principle they proposed two variants of the architecture: VGG16 and VGG19.

image

Then why do we get the improved performance with the more deeper networks? Having a deeper network means that we are injecting more nonlinearities. Because every nonlinear functions injects non-linear transformations in feature space.

It is very critical in learning some linearly separable feature space. At the last layer, we are essentially going to apply the linear classifier. It means that the feature at the second last layer should be linearly separable for different classes. And to turn this highly complex pixels into some vectors that are linearly separable over different classes we need highly nonlinear transformation. That’s why having more nonlinear function was critical in improving performance.

Another contribution was that it just provided the principle of designing the deeper convolutional network much simpler. It shows that we don’t have to play with the size of the kernel too much.

GoogleNet [Szegedy et al. 2014]

After deeper network of VGGNet, many researchers tried to stack more deeper layers more efficiently in terms of parametrization. The main motivation behind the GoogleNet was to design a deeper and much efficient architecture that is a bit more accurate. GoogLeNet actually uses $12 \times$ fewer parameters than the winning architecture of Krizhevsky et al from two years ago, while being significantly more accurate.

This is a rough figure of the GoogleNet. It looks much deeper than VGGNet. At the same time one may see some weird architectures in this case: it has some parallel branches even within the single layer.

image

Motivation

The most straightforward way of improving the performance of deep neural networks was to increase the size such as increasing the depth and width. But this simple solution comes with two major drawbacks.

  1. prone to overfitting
  2. increased need for computational resources $\to$ If the added capacity is used inefficiently, a lot of computation is wasted.

One way of solution is replacing fully connected by sparsely connected architecture. Not only it can mimic biological system, but it can take an advantage from former theoretical works such as Arora et al.

The problem of the paper is, draw a random ground truth network. By choosing some random input samples for this model and feeding these inputs to the model, we will get some outputs for them. With these outputs, can we efficiently learn a network that is probabilistically and statistically equivalent to the ground truth network? Shortly, the paper states YES: if the distribution of the dataset is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed and learned layer by layer by leveraging the structure in the random graphs between layers of the network (analyzing the correlation statistics and clustering neurons with highly correlated outputs, which corresponds to Hebbian theory - cells that fire together wire together). For more detail, this article summarizes the paper well.

But the computing resources that day were inefficient on non-uniform sparse structures. Also, it requires more sophisticated engineering and computing resource. Hence, we need an architecture that employs the extra sparsity, but able to exploit current hardware.

Network In Network (NIN)

Convolutional neural networks (CNNs) can be usually divided into alternating convolutional and pooling operations. Convolutional layer takes inner product of the linear filter and the underlying receptive field followed by a nonlinear activation function at every local portion of the input. Thus, the convolution filter in CNN is a generalized linear model (GLM) for the underlying image patch: CNN filter does the linear dot product with data patch, and nonlinear activation corresponds to link function in GLM)

image

But, the level of abstraction is low with GLM. It works well when the latent space is linearly separable due to nature of linear model and GLM inherit this limitation since link function is restricted to monotone so that invertible. But usually the data often exist on a complex, nonlinear manifold.

To replace GLM, the authors proposed a general nonlinear function approximator called micro network on the convolutional layer, shared across all receptive fields. And, in the paper, MLP is selected as an approximator, which is universal and trainable via backpropagation. Hence, MLP layer non-linearly maps receptive field to output feature, and the feature map is obtained by sliding this network similar with convolutional manner.

  • Why MLP is selected?
    • compatible with the structure of CNN (trainable by back propagation)
    • can be a deep model itself, which is consistent with the spirit of feature re-use

image image

image

Note that the operation of MLP micro network is equivalent to cascaded cross-channel parametric pooling (CCCP) on a normal convolution layer. Usual pooling layers, like max pooling and global average pooling that we use, preserve the number of channel between input and output since it operates on the level of channels; it does pooling within the channel. In constrast, cross-channel pooling operates on across the channels, so that it reduces the number of channels into 1. Additionally, we can interpret each layer of MLP as a parametrized pooling. Each pooling layer performs weighted linear recombination on the input feature maps, followed by ReLU. And the cross-channel pooled feature maps keep cross-channel pooled again in the next layers. Namely, micro network is interpretable as a cascaded cross-channel parametric pooling (CCCP), and it allows complex and learnable interactions of cross channel information: it performs linear combination calculations on the values at the same position in different channels to integrate the spectral information of different channels.

And again cross-channel parametric pooling like in the figure below is equivalent to $1 \times 1$ convolutional layer. And this fact leads to the concept of $1 \times 1$ convolution kernel and motivates a various later study including GoogLeNet.

image

Also, the authors of the paper employed global average pooling layer to output classification confidences from the final feature map instead of vectorizing and passing the traditional fully-connected classifier. Micro networks between each convolutional layers, which act as black boxes, interrupt interpretting the transfer of information. Also, deep fully connected layers in the last are prone to overfitting that harms the performance. In contrast, global average pooling is much intuitive and interpretable since it enforces correspondance between feature maps and category information, which is possible due to a stronger local modeling with micro network. Furthermore, global average pooling is itself a structural regularizer that explicitly enforces feature maps to be confidence maps of categories, which natively prevents overfitting for the overall structure.

In summary, the overall structure is as follows:

image

Inception module

Surprisingly, GoogleNet is deeper than VGGNet but it has much fewer parameters compared to the AlexNet. The key to the GoogleNet is inception module, inspired by NIN (Network In Network) structure. The main flow and idea of Inception architecture is this. From the previous work of Arora et al., we want to find an optimal local sparse structure in CNN architecture. However, as it was mentioned, the current hardwares are optimized to dense computation. So, the authors try to find an optimal local sparse construction that can be approximated as a sparse structure and covered by dense components, and repeat it on the basis of Arora et al.

As a result, GoogLeNet stacks some block called inception module. It is composed of parallel convolution branch. From the input, it applies convolutions in different kernel sizes at the same time, and then at the output it concatenate the outputs from all these individual convolutional layers.

image

It seems that they motivated from NIN to make module inside dense. Furthermore, to follow the idea of Arora et al., they assumed that each highly correlated unit from the earlier layer corresponds to some region of the input image. Then, in the lower layer that is close to the input image, most units would be concentrated in small local regions. And it is sufficient to cover them with $1 \times 1$ convolution like NIN, but there will be also some larger features (clusters) spatially spread out that require more larger patches to cover them.

Hence, Inception module is combined with filters in different size: $1 \times 1$, $3 \times 3$, $5 \times 5$ (this decision was based more on convenience rather than necessity), maxpool with stride of 1 (since previous successes suggest the necessity of pooling). With the $1 \times 1$ convolution, it basically processes each pixels independently, and with the $5 \times 5$ convolution, it observes local neighborhood and the patterns inside.

The possible pratical aspect, or intuition behind this parallel branch is that it basically simulate the multiscale reasoning. Intuitively, human is looking at the image in a multiple scale. The professor in the class sees the classroom as a whole and also focus on faces of individual students. He is actually doing a multi scale reasoning, he can count the structure of the classroom or number of participants by larger scale, at the same time with the same observations he can focus on the local areas and understand the local patterns more closely.

  • $1 \times 1$ convolution

Consider $W \times H \times C$ feature, and $K$ of $1 \times 1 \times C$ convolution filters. The point of applying $1 \times 1$ convolution is projection. Even if all image pixels are processed independently since the kernel size is $1$, it is basically projecting $C$ dimensional feature of the input pixels into $K$ dimensional feature in the output. In other words, it’s basically projecting each pixel features to another feature dimension. If $K$ is much smaller than $C$, it can be used as feature bottleneck that reduces the feature size dramatically.

image

  • Closer inspection

These are the specification of certain block of the inception module.

image

Note that all different layers, including maxpooling, do not change the output feature shape. They are set to have the same spatial feature size, so that it can be concatenated though channel dimensions. For example, if we feed $28 \times 28 \times 256$ as an input feature of this inception module, then the output size of the modules will be computed as follows.

image

In terms of the number of parameters (without the bias parameter), can be computed as follows.

image

One problem of this module is that it applies the multiple convolution, which means even for the single layer, it has too large feature dimension. It will increase the parameter size in upper layers and the memory requirement. Hence, to control this explosion of parameters, we have to inject some bottleneck. If we can apply some bottleleck before we apply module, then one may reduce the number of parameters dramatically.

And the bottleneck can be implemented by $1 \times 1$ convolution for dimensionality reduction as I mentioned earlier. This is based on the success of embeddings: even low dimensional embeddings might contain a lot of information about a relatively large image patch. $1 \times 1$ convolution can keep the representation sparse and also $1 \times 1$ convolution doesn’t change the receptive field but it will change the size of channels. It just factorized the size: $256 \times 192$ to $256 \times 64$ and $64 \times 192$.

image

Then the reduced feature size and number of parameter size can be computed as follows. Thanks to the factorization, inception with bottleneck has significantly reduced parameter size 330k, while naive inception has 1090K parameters.

image

architecture

image

So in the GoogleNet, it first applies the standard convolution layers and then stacks the inception modules. For technical reasons (memory efficiency during training), it seemed beneficial to start using Inception modules only at higher layers while keeping the lower layers in traditional convolutional fashion. Between inception modules, it can apply some max pooling so that it can reduce the spatial size of the feature.

After the last convolution layer, instead of vectorizing the convolution feature map as an input to the fully connected layer, it applies global average pooling to reduce the number of parameters. If the size of the feature map was $S \times S \times C$ and if the dimension of the first fc layer was $d$, by global average pooling it is reduced to $1 \times 1 \times d$.

The last trick they used to train this model is an auxiliary loss for strong gradient signals. As the network is going deeper and deeper, it’ll face a lot of optimization problems like vanishing gradient. What the author of this paper did was to design the skip connections. It takes some intermediate feature and stacked another linear classfier on top of it. Obviously this feature will be less discriminative compared to features after more deeper layers. But if we have this classifier at the intermediate features, we can basically inject the gradient by adding the second source of the gradient to the intermediate layers. Thus, even if the gradient vanishes before this layer it has another source to obtain the goal.

Inception-V2 and -V3 [Szegedy et al. 2015]

As consequences of the efficiency and the number of parameters of GoogLeNet (Inception-V1) $12 \times$ less than VGGNet, GoogLeNet is much practical to use in a wide range of constrained situations such as limited computing resources or big-data scenarios including mobile vision settings. However, the complexity of Inception module hamper employing the module and adapting it to new use-cases without harming its full efficiency. And even the original paper (Szegedy et al. 2014.) didn’t provide clear details of GoogLeNet architecture for it.

To mitigate these issues, Szegedy et al. 2015. tried to modify `

General Design Principles

Before we start, we should note a few general design principles and optimization ideas (speculative) that proved to be useful for scaling up CNNs in efficient ways, based on large-scale experimentation with various architectural choices with CNNs.

(1) Avoid representational bottlenecks, especially early in the network. In general, the representation size should gently decrease from the inputs to the outputs before reaching the final representation used for the task at hand.

(2) Higher dimensional representations are easier to process locally within a network. Increasing the activa- tions per tile in a convolutional network allows for more disentangled features.

(3) Spatial aggregation can be done over lower dimensional embeddings without much or any loss in representational power. Here, you can think “spatial aggregation” as convolutional layer operation. For example, before performing a more spread out convoultion like $3 \times 3$ to an input, one can reduce the dimension of the input such as $1 \times 1$ convolution before the spatial aggregation without expecting side effect. The authors hypothesized this is owing to the strong correlation between adjacent unit: the strong correlation among them helps to maintain the information during dimension reduction and to be easily compressible in a spatial aggregation.

(4) Balance the width and depth of the network. Optimal performance of the network can be reached by balancing the number of filters per stage and the depth of the network.

On the basis of these principles, they explored other options to improve the original inception module proposed in GoogLeNet so as to reduce the computation and increase the efficiency.

Factorizing Convolutions in Inception module

Factorization into smaller convolutions

First of all, they factorized convolutions with large filters into smaller ones. One caution is, even large spatial filter has larger number of parameters and computational costs than smaller filter, it has much powerful expressiveness. In other words, it has the potential to capture widespread dependencies further away in the earlier layers.

To reflect this fact to their solution, the intuition is to view convolution layer as a fully connected layer. For example, let’s consider $5 \times 5$ convolution and $5 \times 5$ data patch for computation. Then, by factorizing this with $3 \times 3$ convolution and fully connected layer on top of these convolution like figure below, (so we are factorizing $5 \times 5$ convolution by two $3 \times 3$ convolution layers), it could reduce the cost clearly.

image

Still, their are some doubts for these replacement.

  1. Does this replacement result in any loss of expressiveness?
  2. Would it not suggest to keep linear activations after the first layer?

For the second question, they found using linear activation was always inferior to using ReLU in all steps of the factorization after several experiments.

Anyway, the following figure shows the first modification of Inception module.

image

Spatial Factorization into Asymmetric Convolutions

After factorizing $5 \times 5$ convolutions by $3 \times 3$ convolution layers, the next question is straightforward: can we reduce $3 \times 3$ convolution into much smaller filter again? What the authors found is $n \times n$ convolution layer can be factorized into a conbination of $n \times 1$ and $1 \times n$ convolutions, which are called asymmetric convolution.

image

For instance, applying $3 \times 3$ convolution to the data patch is equivalent to first applying $3 \times 1$ convolution and performing $1 \times 3$ convolution to that output.

But in practice, it does not work well on early layers. It can affect well on medium grid-sizes: on $m \times m$ feature maps, $m = 12, \cdots, 20$. And on that level the best experimental result can be achieved by $n = 7$. For better understanding, here is a figure of Inception modules after the factorization of the $n \times n$ convolutions.

image

Utility of Auxiliary Classifiers

In GoogLeNet, it introduced two auxiliary classifiers during the training after Inception 4a block and 4d block in order for pushing useful gradient to the lower layers to avoid vanishing gradient and improve the convergence. The hypothesis of auxiliary classifiers is that they help evolving the low-level feature.

But, their experimental results imply this is misplaced:

  1. Training progression of network with and without auxiliary classifiers are nearly identical.
  2. The removal of the lower auxiliary classifier (after Inception 4a block) did not have any side effect on the final result of the network.

Hence, to reduce the additional computation and parameters during the training, they removed the auxiliary classifier in the lower layer in new versions. And, the authors claimed that it can play the role of regularizer, supported by the experimental fact that the main classifier of the network performs better if the side auxiliary classifier is batch-normalized or has a dropout layer.

Grid Size Reduction

To reduce the grid size of feature maps, we traditionally perform pooling operations before convolution. Problem is, there is a trade-off that it brings about representation bottleneck. One possibility to mitigate is to expand the number of filters (channels) by changing the order of operations between pooling and convolution, but it agains increase the computational cost of convolution.

image

The authors proposed another way to kill two birds with one stone; just combine two alternative ways:

image

By applying this idea into our Inception module, it can be modified as follows:

image

Architecture

Finally, we can construct an architecture by connecting the dots from above:

image

Here, the authors set $n = 7$ for $17 \times 17$ grid input. Here is a detailed model architecture:

image

(credit: https://hackmd.io/@machine-learning/SkD5Xd4DL)

ResNet

While going deeper with neural networks, there are great chance of adverse effects that hamper the training such as gradient vanishing. And these problems triggered the new idea that lead to various training techniques or networks. For instance, the main motivation of auxiliary classfiers in Inception architectures is to push extra gradient to the lower layers to avoid vanishing gradient problem, too.

In He et al. 2015, the authors focused on the degradation problem. From the trend having a deeper architecture give you the better improvement, we can just think that just stacking more layers will be the best way to improvement. But it turned out that stacking layers is not always great if you are not carefully designing the architecture. The author of ResNet did some experiments a preliminary experiment that they trained two different neural networks one with 20 layers and the other one is 56 layers. What they found was that actually the 20-layer neural network performs better than deeper one, in terms of both training and test error, which implies it is not problem of overfitting.

image

In theory, deeper network should be at least as good as the shallow one in fitting the training data. Consider a 20-layer neural network, and 56-layer neural network that additionally stacked 36 layers on the top of 20-layer neural network. Then, the deeper layer is possible to express the shallow one; just setting 36 layers as identity mapping. But degradation problem denies this self-evident fact.

Hence, the reason why deeper network performs worse experimentally maybe: larger networks are much difficult to optimize. There are many potential reasons for this in deeper networks and one possible can be gradient vanishing. But, in the experiment they showed that this optimization difficulty is unlikely to be caused by vanishing gradients.

Residual Learning

The author of this paper proposed very simple way to achieve that. They proposed to add a shortcut connection, which basically adds the input $\mathbf{x}$ itself at the output of the convolution blocks. Let $\mathcal{H}$ be the desired mapping that the layers should approximate. Then, deep residual learning framework that they proposed forces the layers to learn residual mapping $\mathcal{F}(\mathbf{x}) := \mathcal{H}(\mathbf{x}) - \mathbf{x}$. And they hypothesized that these residual mapping $\mathcal{F}$ is much easier to optimize (learn) than original mapping $\mathcal{H}$. For instance, when $\mathcal{H}$ is identity mapping, it is much easier to push the residual of output to zero rather than fit the stack of nonlinear layers into the original identity mapping. And it is exactly reflected in the degradation problem, which suggests that the solvers might have difficulties in approximating identity mappings by multiple nonlinear layers.

The formulation of $\mathcal{F}(\mathbf{x}) + \mathbf{x}$ can be realized by feedfor- ward neural networks with shortcut connections in neural networks:

image

In this figure, the short connection is valid only if the dimensions of $\mathbf{x}$ and $\mathcal{F}$ must be equal. If it is not, we can perform a linear projection $W_s$ by the shortcut connections to match the dimension: $\mathbf{y} = \mathcal{F}(\mathbf{x}) + W_s \mathbf{x}$. We can also use a matrix $W_s$ in general case, but it turned out by experiments that the identity mapping is sufficient for addressing the degradation problem and is economical. Thus, $W_s$ is only used when matching dimensions.

$\mathbf{Remark.}$ (1) It does not introduce any extra parameters (if the dimension of $\mathbf{x}$ and the output of $\mathcal{F}$ are matched) and computational complexity. (2) You can set any number of layers inside of shortcut connection, but it is not meaningful when $\mathcal{F}$ has only one layer: it becomes just a simple linear function $\mathbf{y} = W \mathbf{x} + \mathbf{x}$. (3) These residual learning framework can be also applied in convolutional layers. In that case, the element-wise addition is performed on two feature maps channel by channel.

~ 이거 우짜나ㅑ gradient 얘기.. Then, if you take the partial derivative of this layer with respect to x this would be $1 + \frac{\partial F(x)}{\partial x}$. Intuitively, also this residual connection makes the learning I mean makes the IK neural network to learn residual knowledge. Instead of learning the entire transformation it learns the transformation for the residual.

the gradient of early $\ell - 1$ layers can be flow to the later layer via identity mapping.

Architecture

34-layer Residual Network is based on VGG19 architecture, but shortcut connections are added to every two layers:

image

The solid line denotes identity shortcut, while the dotted line expresses two option: zero padding the input or projection shortcut by $1 \times 1$ convolution. Zero padding doesn’t introduce any extra parameters, but it actually has no residual learning. In experiment, they compared 3 options, $A$ - zero-padding shortcuts for increasing dimensions / $B$ - projection shortcuts for increasing dimensions / $C$ - all shortcuts are projections, by error rates on ImageNet validation.

image

The results were $C > B > A$. However, a small difference between them implies that the main problem is not caused by projection shortcuts. So C had the best performance, but we don’t use it to simplify the model.

And similar to the GoogLeNet they also exploited $1 \times 1$ convolution to make this parameterization efficient in much deeper networks. Instead of applying plain convolution blocks directly, it applies this $1 \times 1$ convolution with the smaller channel size as a bottleneck, a form of efficient parameterization. It reduces and then increases (restores) dimensions, leaving the $3 \times 3$ convolution a bottleneck with smaller input/output dimensions. The following figure shows examples of two versions of ResNet building blocks, where both designs have similar time complexity:

image

Here, a projection shortcut doubles the time complexity and model size of bottleneck block as the shortcut is connected to the two high-dimensional ends. So identity shortcuts is much more efficient models for the bottleneck designs.

See the next table for detailed architectures.

image

DenseNet [G. huang et al. 2017]

The emergence of ResNet arised a new approach of stacking deeper network: create short paths from early layers to later layers. DenseNet is also proposed on the basis of this idea, but it connects all layers with each other to achieve maximum information flow between layers, while ResNet connects layers intermittently.

Dense connectivity

Let $\mathbf{x}0$ be an input image of convolutional neural network with $L$ layers. Let $H\ell (\cdot)$ be a non-linear transformation of each layer $\ell$. It can be a composite function of operations such as batch normalization, activation, pooling layer, or convolution layer. For example, in the paper $H_\ell$ is implemented as a composite function of three consecutive operations: BN, ReLU, and $3 \times 3$ conv, motivated by Identity Mappings in Deep Residual Networks.

Then, let $\mathbf{x}\ell$ be the output of layer $\ell$. ResNet implements a skip-connection that bypasses the non-linear transformations with an identity function: $\mathbf{x}\ell = H_\ell (\mathbf{x}{\ell - 1}) + \mathbf{x}{\ell - 1}$. Hence, the gradient of early $\ell - 1$ layers can be flow to the later layer via identity mapping. But the information flow is not straightforward as two information is combined by summation.

image

Instead of skip-connection, DenseNet employs a different connectivity pattern that connects any layer to all subsequent layers of it by concatenation, not summation.

\[\begin{aligned} \mathbf{x}_\ell = H_\ell ( [\mathbf{x}_0, \mathbf{x}_1, \cdots, \mathbf{x}_{\ell - 1}] ) \end{aligned}\]

image

Hence, the convolutional neural network with $L$ layers will have total $\frac{L(L+1)}{2}$ number of connections.

Architecture

To ensure downsampling in the network, DenseNet can be divided into two blocks: Dense block and transition block. The transition layers consist of $1 \times 1$ convolutional layer and BN, followed by $2 \times 2$ average pooling that downsamples the feature map (It changes feature map sizes via convolution and pooling).

image

DenseNet-B

Likewise ResNet, the authors also introduced bottleneck layer via $1 \times 1$ convolution before each $3 \times 3$ convolution in Dense block to reduce the number of input feature maps, aiming to improve computational efficiency. (The authors referred to DenseNet with bottleneck layer as DenseNet-B)

DenseNet-C

Also they reduce the number of feature maps at transition layers too in order for improvement of model compactness. For instance, if a dense block contains $m$ feature maps, the following transition layer generate $\lfloor \theta m \rfloor$ output feature maps, where $0 < \theta \leq 1$ is called the compression factor. (If $\theta < 1$, DenseNet referred to as DenseNet-C. If DenseNet-C has bottleneck layer, it is called DenseNet-BC.)

장점?

Param eff (Feature reuse)

Suppose each $H_\ell$ produces $k$ number of feature maps. If $k_0$ is the number of channels of input layer, $\ell$th layer will have $k0 + k \times (\ell - 1)$ number of input feature maps. The authors refer to $k$ as growth rate. A notable difference between DenseNet and existing network architectures is that DenseNet can have very narrow layer, in other words, small $k$ for instance $k = 12$.

One hypothesis is that each $\ell$th layer has access to all the preceding feature maps (network’s “collective knowledge”), which contain the global state of the network. One can view the feature-maps as the global state of the network. Each layer adds k feature-maps of its own to this state. The growth rate regulates how much new information each layer con- tributes to the global state. The global state, once written, can be accessed from everywhere within the network and, unlike in traditional network architectures, there is no need to replicate it from layer to layer.

Improved flow of information

Leave a comment