Hay rolls. Neopan Acros 100 film.

10 Days Of Grad: Deep Learning From The First Principles

Day 5: Convolutional Neural Networks Tutorial

Today we have finally arrived at the "master algorithm" in computer vision. That is how François Chollet calls convolutional neural networks (CNNs). First, we are going to build an intuition behind those algorithms. Then, we are taking a look at basic CNN architecture. After discussing the differences between convolutional layer types, we are going to implement them in Haskell.

Previous posts

Convolution operator

Previously, we have learned about fully-connected neural networks. Although, theoretically those can approximate any reasonable function, they have certain limitations. One of the challenges is to address the translation symmetry. To explain this, let us take a look at the two cat pictures below.

Translation symmetry: Same object in different locations.

Translation symmetry: Same object in different locations.

For us, humans, it does not matter if a cat is in the right lower corner of it is somewhere in the top part of an image. In both cases we find a cat. So we can say that our human cat detector is translation invariant.

However, if we look at the architecture of a typical fully-connected network, we may realize that there is actually nothing that prevents this network to work correctly only on some part of an image. The question here: Is there any way to make a neural network translation invariant?

Fully-connected neural network with two hidden layers. Image credit: Wikimedia.

Fully-connected neural network with two hidden layers. Image credit: Wikimedia.

Let us take a closer look at the cat image. Soon we realize that pixels representing cat's head are more contextually related to each other than they are related to pixels representing cat's tail. Therefore, we also would make our neural network sparse so that neurons in the next layer are connected only to the relevant neighboring pixels. This way, each neuron in the next layer would be responsible for a small feature in the original image. The area that a neuron "sees" is called a receptive field.

Zoom into the cats figure.

Neighboring pixels give more relevant information than distant ones.

Zoom into the cats figure.

Convolutional neural networks (CNNs) or simply ConvNets were designed to address those two issues: translation symmetry and image locality. First, let us give an intuitive explanation of a convolution operator1.

You have very likely encountered convolution filters before. Recall when you have first played with a (raster) graphics editor like GIMP or Photoshop. You have probably been delighted obtaining effects such as sharpening, blur, or edge detection. If you haven't, then you probably should :). The secret of all those filters is the convolutional application of an image kernel. The image kernel is typically a $3 \times 3$ matrix such as below.

Dot product between pixel values and a kernel. Image credit: [GIMP](https://docs.gimp.org/2.8/en/images/filters/examples/convolution-calculate.png).

A single convolution step:

Dot product between pixel values and a kernel. Image credit: GIMP.

Here is shown a single convolution step. This step is a dot product between the kernel and pixel values. Since all the kernel values except the second one in the first row are zeros, the result is equal to the second value in the first row of the green frame, i.e. $40 \cdot 0 + 42 \cdot 1 + 46 \cdot 0 + \dotsc + 58 \cdot 0 = 42$. The convolution operator takes an image and acts within the green "sliding window" to perform dot product over every part of that image. The result is a new, filtered image. Mathematically, the (discrete) convolution operator $(*)$ between an image $A \in \mathbb{R}^{D_{F_1} \times D_{F_2}}$ and a kernel $K \in \mathbb{R}^{D_{K_1} \times D_{K_2}}$ can be formalized as $$A * K = \sum_{m=0}^{D_{K_1} - 1} \sum_{n=0}^{D_{K_2} - 1} K_{m, n} \cdot A_{i-m, j-n},$$ where $0 \le i < D_{K_1} + D_{F_1} - 1$ and $0 \le i < D_{K_2} + D_{F_2} - 1$. To better understand how convolution with a kernel changes the original image, you can play with different image kernels.

What is the motivation behind this sliding window/convolution operator approach? It has biological background. In fact, human eye has a relatively narrow visual field. We perceive objects as a whole by constantly moving eyes around them. These rapid eye movements are called saccades. Therefore, convolution operator may be regarded as a simplified model of image scanning that occurs naturally. The important point is thanks to the sliding window convolutions achieve translation invariance. Moreover, since every dot product result is connected - through the kernel - only to a very limited number of pixels in the initial image, convolution connections are very sparse. Therefore, by using convolutions in neural networks we achieve both translation invariance and connection sparsity.

Convolutional Neural Network Architecture

An interesting property of convolutional layers is that if the input image is shifted, the feature map output will be shifted by the same amount, but it will be left unchanged otherwise. This property is at the basis of the robustness of convolutional networks to shifts and distortions of the input.

Once a feature has been detected, its exact location becomes less important. Only its approximate position relative to other features is relevant.

Lecun et al. Gradient-based learning applied to document recognition (1998)

The prototype of what we call today convolutional neural networks has been first proposed back in 1980s by Fukushima. There were proposed many unsupervised and supervised training methods, but today CNNs are trained with backpropagation. Let us take a look at one of the famous ConvNet architectures known as LeNet-5.

LeNet-5 architecture from [Lecun _et al._ Gradient-based learning applied to document recognition](http://doi.org/10.1109/5.726791).

LeNet-5 architecture from Lecun et al. Gradient-based learning applied to document recognition.

The architecture is very close to modern CNNs. LeNet-5 was designed to perform handwritten digit recognition from $32 \times 32$ black and white images. The two main building blocks, as we call them now, are a feature extractor and a classifier.

With local receptive fields neurons can extract elementary visual features such as oriented edges, endpoints, corners...

Lecun et al. Gradient-based learning applied to document recognition (1998)

The feature extractor consists of two convolutional layers. The first convolutional layer has six convolutional filters with $5 \times 5$ kernels. Application of those filters with subsequent bias additions and hyperbolic tangent activations2 produces feature maps, essentially new, slightly smaller ($28 \times 28$) images. By convention, we describe the result as a volume of $28 \times 28 \times 6$. To reduce the spatial resolution, a subsampling is then performed3. That outputs $14 \times 14 \times 6$ feature maps.

All the units in a feature map share the same set of 25 weights and the same bias, so they detect the same feature at all possible locations on the input.

Lecun et al. Gradient-based learning applied to document recognition (1998)

The next convolutions round results already in $10 \times 10 \times 16$ feature maps. Note that unlike the first convolutional layer, we apply $5 \times 5 \times 6$ kernels. That means that each of sixteen convolutions simultaneously processes all six feature maps obtained from the previous step. After subsampling we obtain a resulting volume of $5 \times 5 \times 16$.

The classifier consists of three densely connected layers with 120, 84, and 10 neurons each. The last layer provides a one-hot-encoded4 answer. The slight difference from modern archictures is in the final layer, which consists of ten Euclidean radial basis function units, whereas today this would be a normal fully-connected layer followed by a softmax layer.

It is important to understand that a single convolution filter is able to detect only a single feature. For instance, it may be able to detect horizontal edges. Therefore, we use several more filters with different kernels to have get features such as vertical edges, simple textures, or corners. As you have seen, the number of filters is typically represented in ConvNet diagrams as volume. Interestingly, layers deeper in the network will combine the most basic features detected in the first layers into more abstract representations such as eyes, ears, or even complete figures. To better understand this mechanism let us inspect receptive fields visualization below.

Your browser does not support HTML5 canvas.

Receptive field visualization derived from Arun Mallya. Hover the mouse cursor over any neuron in top layers to see how extends its receptive field in previous (bottom) layers.

As we can see by checking neurons in last layers, even a small $3 \times 3$ receptive field grows as one moves towards first layers. Indeed, we may anticipate that "deeper" neurons will have better overall view on what happens in the image.

The beauty and the biggest achievement of deep learning is that filter kernels are learned automatically with back propagation5. A peculiarity of convolutional layers is that the result is obtained after repetitive application of a small number of weights as defined by a kernel. Thanks to this weight sharing, convolutional layers have drastically reduced number of trainable parameters6, compared to fully-connected layers.

Convolution Types

I decided to include this section for curious readers. If this is the first time you encounter CNNs, feel free to skip the section and revisit it later.

There is a lot of hype around convolutions nowadays. However, it is not made clear that low-level convolutions for computer vision are often different from those exploited by ConvNets. Yet, even in ConvNets there is a variety of convolutional layers inspired by Inception ConvNet architecture and shaped by Xception and Mobilenet works. I believe that you deserve to know that there exist multiple kinds of convolutions applied in different contexts and here I shall provide a general roadmap7.

1. Computer Vision-Style Convolution

Low-level computer vision (CV), for instance graphics editors, typically operate one, three, or four channels images (e.g. red, green, blue, and transparency). An individual kernel is typically applied to each channel. In this case, usually there are as many resulting channels as there are channels in the input image. A special case of this style convolution is when the same convolution kernel is applied to each channel (e.g. blurring). Sometimes, resulting channels are summed producing a one-channel image (e.g. edge detection).

2. LeNet-like Convolution

A pure CV-style convolution is different from those in ConvNets due to two reasons: (1) in CV kernels are manually defined, whereas the power of neural networks comes from training, and (2) in neural networks we build a deep structure by stacking multiple convolutions on top of each other. Therefore, we need to recombine the information coming from previous layers. That allows us to train higher-level feature detections8. Finally, convolutions in neural networks may contain bias terms, i.e. constants added to results of each convolution.

Recombination of features coming from earlier layers was previously illustrated in LeNet-5 example. As you remember, in the second convolutional layer we would apply 3D kernels of size $5 \times 5 \times 6$ computing dot products simultaneously on all six feature maps from the first layer. There were sixteen different kernels thus producing sixteen new channels.

Each pixel in the feature map is obtained as a dot product between the RGB color channels and the sliding kernel. Image credit: Wikimedia.

Convolution filter with three input channels.

Each pixel in the feature map is obtained as a dot product between the RGB color channels and the sliding kernel. Image credit: Wikimedia.

To summarize, a single LeNet-like convolution operates simultaneously on all input channels and produces a single channel. By having an arbitrary number of kernels, any number of output channels is obtained. It is not uncommon to operate on volumes of 512 channels! The computation cost of such convolution is $D_K \times D_K \times M \times N \times D_F \times D_F$ where $M$ is the number of input channels, $N$ is the number of output channels, $D_K \times D_K$ is the kernel size and $D_F \times D_F$ is the feature map size9.

3. Depthwise Separable Convolution

LeNet-style convolution requires a large number of operations. But do we really need all of them? For instance, can spatial and cross-channel correlations be somehow decoupled? The Xception paper largely inspired by Inception architecture shows that indeed, one can build more efficient convolutions by assuming that spatial correlations and cross-channel correlations can be mapped independently. This principle was also applied in Mobilenet architectures.

The depthwise separable convolution works the following way. First, like in low-level computer vision, individual kernels are applied to each individual channel. Then, after optional activation10, there is another convolution, but this time exclusively in-between channels. That is typically achieved by applying a $1 \times 1$ convolution kernel. Finally, there is a (ReLU) activation.

This way, depthwise separable convolution has two distinct steps: a space-only convolution and a channel recombination. This reduces the number of operations to $D_K \times D_K \times M \times D_F \times D_F + M \times N \times D_F \times D_F$9.

To be continued...


Convolutional neural networks aka ConvNets achieve translation invariance and connection sparsity. Thanks to weight sharing, ConvNets dramatically reduce the number of trained parameters. The power of ConvNet comes from training convolution filters, in contrast to manual feature engineering.

In the next post we will apply CNNs to MNIST and CIFAR-10 image challenges. Stay tuned!

Further reading

Learned today:

Deeper into neural networks:

  1. In fact, cross-correlation, not convolution. Still, for neural networks that would be practically the same operation. ^
  2. The actual "squashing" activation was $f(x) = 1.7159 \tanh(\frac{2}{3} x)$. ^
  3. By subsampling LeNet authors mean local averaging in $2 \times 2$ squares with subsequent scaling by a constant, bias addition, and sigmoid activation. In modern CNNs, a simple max-pooling is performed instead. ^
  4. We have discussed one-hot encoding on Day 1. ^
  5. To refresh your memory about backprop algorithm, check out previous days. ^
  6. Typically, for each convolutional layer the number of parameters is equal to the number of kernel weights plus a trainable bias. For instance, for a $5 \times 5 \times 6$ kernel, this number is equal to 151. ^
  7. Whereas I discuss only 2D convolutions that are useful for image-like objects, those convolutions can be generalized to 1D and 3D. ^
  8. By higher-level features I mean features detected by layers deeper in the network, such as geometric shapes, eyes, ears, or even complete figures. ^
  9. For the details, see Mobilenet paper. ^
  10. In Xception architecture a better result was achieved without an intermediate activation (on ImageNet classification challenge). In addition, activations - when they are present - are preceded by batch normalization. Batch normalization results in no added bias term after convolutions. ^