Today we have finally arrived at the "master algorithm" in computer vision. That is how François Chollet calls convolutional neural networks (CNNs). First, we are going to build an intuition behind those algorithms. Then, we are taking a look at basic CNN architecture. After discussing the differences between convolutional layer types, we are going to implement them in Haskell.

**Previous posts**

- Day 1: Learning Neural Networks The Hard Way
- Day 2: What Do Hidden Layers Do?
- Day 3: Haskell Guide To Neural Networks
- Day 4: The Importance Of Batch Normalization

## Convolution operator

Previously, we have learned about fully-connected neural networks. Although, theoretically those can approximate any reasonable function, they have certain limitations. One of the challenges is to address the translation symmetry. To explain this, let us take a look at the two cat pictures below.

For us, humans,
it does not matter if a cat is in the right lower corner of it is
somewhere in the top part of an image. In both cases we find a cat.
So we can say that our human cat detector is *translation invariant*.

However, if we look at the architecture of a typical fully-connected network, we may realize that there is actually nothing that prevents this network to work correctly only on some part of an image. The question here: Is there any way to make a neural network translation invariant?

Let us take a closer look at the cat image.
Soon we realize that pixels representing cat's head are more contextually related to each other
than they are related to pixels representing cat's tail. Therefore, we also would make
our neural network *sparse* so that neurons in the next layer are
connected only to the relevant neighboring pixels. This way, each neuron
in the next layer would be responsible for a small *feature* in the original image.
The area that a neuron "sees" is called a *receptive field*.

Convolutional neural networks (CNNs) or simply *ConvNets* were
designed to address those two issues: translation symmetry
and image locality. First, let us give an intuitive explanation
of a convolution operator^{1}.

You have very likely encountered convolution filters before.
Recall when you have first played with a (raster) graphics editor like GIMP or Photoshop.
You have probably been delighted obtaining effects such as sharpening, blur, or
edge detection. If you haven't, then you probably should :).
The secret of all those filters is the convolutional application of an *image kernel*.
The image kernel is typically a $3 \times 3$ matrix such as below.

Here is shown a single convolution step. This step is a dot product between the kernel and pixel values. Since all the kernel values except the second one in the first row are zeros, the result is equal to the second value in the first row of the green frame, i.e. $40 \cdot 0 + 42 \cdot 1 + 46 \cdot 0 + \dotsc + 58 \cdot 0 = 42$. The convolution operator takes an image and acts within the green "sliding window" to perform dot product over every part of that image. The result is a new, filtered image. Mathematically, the (discrete) convolution operator $(*)$ between an image $A \in \mathbb{R}^{D_{F_1} \times D_{F_2}}$ and a kernel $K \in \mathbb{R}^{D_{K_1} \times D_{K_2}}$ can be formalized as $$A * K = \sum_{m=0}^{D_{K_1} - 1} \sum_{n=0}^{D_{K_2} - 1} K_{m, n} \cdot A_{i-m, j-n},$$ where $0 \le i < D_{K_1} + D_{F_1} - 1$ and $0 \le i < D_{K_2} + D_{F_2} - 1$. To better understand how convolution with a kernel changes the original image, you can play with different image kernels.

What is the motivation behind this sliding window/convolution operator approach?
It has biological background.
In fact, human eye has a relatively narrow *visual field*.
We perceive objects as a whole by constantly moving eyes around them.
These rapid eye movements
are called *saccades*.
Therefore, convolution operator may be regarded as a simplified model of
image scanning that occurs naturally. The important point
is thanks to the sliding window convolutions achieve translation invariance. Moreover,
since every dot product result is connected - through the kernel - only to a very
limited number of pixels in the initial image, convolution connections
are very sparse. Therefore, by using convolutions
in neural networks we achieve both translation invariance and connection sparsity.

## Convolutional Neural Network Architecture

An interesting property of convolutional layers is that if the input image is shifted, the feature map output will be shifted by the same amount, but it will be left unchanged otherwise. This property is at the basis of the robustness of convolutional networks to shifts and distortions of the input.

Once a feature has been detected, its exact location becomes less important. Only its approximate position relative to other features is relevant.

Lecun

et al.Gradient-based learning applied to document recognition (1998)

The prototype of what we call today convolutional neural networks has been first proposed back in 1980s by Fukushima. There were proposed many unsupervised and supervised training methods, but today CNNs are trained with backpropagation. Let us take a look at one of the famous ConvNet architectures known as LeNet-5.

The architecture is very close to modern CNNs.
LeNet-5 was designed to perform handwritten digit recognition
from $32 \times 32$ black and white images.
The two main building blocks, as we call them now,
are a *feature extractor* and a *classifier*.

With local receptive fields neurons can extract elementary visual features such as oriented edges, endpoints, corners...

Lecun

et al.Gradient-based learning applied to document recognition (1998)

The *feature extractor* consists of
two convolutional layers. The first convolutional
layer has six convolutional filters with $5 \times 5$ kernels.
Application of those filters with subsequent bias additions and
hyperbolic tangent activations^{2} produces *feature maps*,
essentially new, slightly smaller ($28 \times 28$) images.
By convention, we describe the result as a volume of
$28 \times 28 \times 6$.
To reduce the spatial resolution,
a *subsampling* is then performed^{3}.
That outputs $14 \times 14 \times 6$ feature maps.

All the units in a feature map share the same set of 25 weights and the same bias, so they detect the same feature at all possible locations on the input.

Lecun

et al.Gradient-based learning applied to document recognition (1998)

The next convolutions round results already in $10 \times 10 \times 16$ feature maps. Note that unlike the first convolutional layer, we apply $5 \times 5 \times 6$ kernels. That means that each of sixteen convolutions simultaneously processes all six feature maps obtained from the previous step. After subsampling we obtain a resulting volume of $5 \times 5 \times 16$.

The *classifier* consists of three densely connected layers
with 120, 84, and 10 neurons each. The last layer
provides a one-hot-encoded^{4} answer. The slight difference
from modern archictures is in the
final layer, which consists of ten Euclidean radial basis function
units, whereas today this would be a normal
fully-connected layer followed by a softmax layer.

It is important to understand that a single convolution filter is able to detect only a single feature. For instance, it may be able to detect horizontal edges. Therefore, we use several more filters with different kernels to have get features such as vertical edges, simple textures, or corners. As you have seen, the number of filters is typically represented in ConvNet diagrams as volume. Interestingly, layers deeper in the network will combine the most basic features detected in the first layers into more abstract representations such as eyes, ears, or even complete figures. To better understand this mechanism let us inspect receptive fields visualization below.

Receptive field visualization derived from Arun Mallya. Hover the mouse cursor over any neuron in top layers to see how extends its receptive field in previous (bottom) layers.

As we can see by checking neurons in last layers, even a small $3 \times 3$ receptive field grows as one moves towards first layers. Indeed, we may anticipate that "deeper" neurons will have better overall view on what happens in the image.

The beauty and the biggest achievement of deep learning
is that filter kernels are learned automatically with
back propagation^{5}.
A peculiarity of convolutional layers is that
the result is obtained after repetitive application of
a small number of weights as defined by a kernel.
Thanks to this *weight sharing*, convolutional layers
have drastically reduced number of trainable parameters^{6},
compared to fully-connected layers.

## Convolution Types

I decided to include this section for curious readers. If this is the first time you encounter CNNs, feel free to skip the section and revisit it later.

There is a lot of hype around convolutions nowadays. However, it is not made
clear that low-level convolutions for computer vision are often different from
those exploited by ConvNets. Yet, even in ConvNets there is a variety of
convolutional layers inspired by Inception ConvNet architecture
and shaped by Xception and Mobilenet works. I
believe that you deserve to know that there exist multiple kinds of
convolutions applied in different contexts and here I shall provide a general
roadmap^{7}.

### 1. Computer Vision-Style Convolution

Low-level computer vision (CV), for instance graphics editors, typically operate
one, three, or four channels images (e.g. red, green, blue, and transparency).
An individual kernel is typically applied to each channel. In this case,
usually there are as many resulting channels as there are channels in the input
image. A special case of this style convolution is when *the same* convolution
kernel is applied to each channel (e.g. blurring). Sometimes, resulting
channels are summed producing a one-channel image (e.g. edge detection).

### 2. LeNet-like Convolution

A pure CV-style convolution is different from those in ConvNets due to two
reasons: (1) in CV kernels are manually defined, whereas the power of neural
networks comes from training, and (2) in neural networks we build
a deep structure by stacking multiple convolutions on top of each other.
Therefore, we need to recombine the information coming from previous
layers. That allows us to train higher-level feature detections^{8}.
Finally, convolutions in neural networks may contain bias terms, i.e.
constants added to results of each convolution.

Recombination of features coming from earlier layers was previously illustrated in LeNet-5 example. As you remember, in the second convolutional layer we would apply 3D kernels of size $5 \times 5 \times 6$ computing dot products simultaneously on all six feature maps from the first layer. There were sixteen different kernels thus producing sixteen new channels.

To summarize, a single LeNet-like convolution operates simultaneously on all
input channels and produces a single channel. By having an arbitrary number of
kernels, any number of output channels is obtained. It is not uncommon to operate
on volumes of 512 channels! The computation cost of such convolution is
$D_K \times D_K \times M \times N \times D_F \times D_F$
where $M$ is the number of input channels,
$N$ is the number of output channels,
$D_K \times D_K$ is the kernel size and $D_F \times D_F$ is
the feature map size^{9}.

### 3. Depthwise Separable Convolution

LeNet-style convolution requires a large number of operations. But do we really need all of them? For instance, can spatial and cross-channel correlations be somehow decoupled? The Xception paper largely inspired by Inception architecture shows that indeed, one can build more efficient convolutions by assuming that spatial correlations and cross-channel correlations can be mapped independently. This principle was also applied in Mobilenet architectures.

The depthwise separable convolution works the following way. First, like in
low-level computer vision, individual kernels are applied to each individual channel.
Then, after optional activation^{10},
there is another convolution, but this time
exclusively in-between channels. That is typically achieved by applying
a $1 \times 1$ convolution kernel. Finally, there is a (ReLU) activation.

This way, depthwise separable convolution has two distinct steps: a space-only
convolution and a channel recombination. This reduces the number of operations
to
$D_K \times D_K \times M \times D_F \times D_F + M \times N \times D_F \times D_F$^{9}.

To be continued...

## Summary

Convolutional neural networks *aka* ConvNets achieve translation invariance
and connection sparsity. Thanks to weight sharing, ConvNets dramatically
reduce the number of trained parameters. The power of ConvNet comes from
training convolution filters, in contrast to manual feature engineering.

In the next post we will apply CNNs to MNIST and CIFAR-10 image challenges. Stay tuned!

## Further reading

Learned today:

- Interactive Image Kernels
- The Ancient Secrets of Computer Vision (online course)
- Why I prefer functional programming
- Efficient Parallel Stencil Convolution in Haskell

Deeper into neural networks:

- In fact, cross-correlation, not convolution. Still, for neural networks that would be practically the same operation.
^{^} - The actual "squashing" activation was $f(x) = 1.7159 \tanh(\frac{2}{3} x)$.
^{^} - By subsampling LeNet authors mean local averaging in $2 \times 2$ squares with subsequent scaling by a constant, bias addition, and sigmoid activation. In modern CNNs, a simple max-pooling is performed instead.
^{^} - We have discussed one-hot encoding on Day 1.
^{^} - To refresh your memory about backprop algorithm, check out previous days.
^{^} - Typically, for each convolutional layer the number of parameters is equal to the number of kernel weights plus a trainable bias. For instance, for a $5 \times 5 \times 6$ kernel, this number is equal to 151.
^{^} - Whereas I discuss only 2D convolutions that are useful for image-like objects, those convolutions can be generalized to 1D and 3D.
^{^} - By higher-level features I mean features detected by layers deeper in the network, such as geometric shapes, eyes, ears, or even complete figures.
^{^} - For the details, see Mobilenet paper.
^{^} - In Xception architecture a better result was achieved without an intermediate activation (on ImageNet classification challenge). In addition, activations - when they are present - are preceded by batch normalization. Batch normalization results in no added bias term after convolutions.
^{^}