Image credit

10 Days Of Grad: Deep Learning From The First Principles

Day 1: Learning Neural Networks The Hard Way

Neural networks is a topic that recurrently appears throughout my life. Once, when I was a BSc student, I got obsessed with the idea to build an "intelligent" machine1. I spent a couple of sleepless nights thinking. I read a few essays shedding some light on this philosophical subject, among which the most prominent, perhaps, stand Marvin Minsky's writings2. As a result, I came across neural networks idea. It was 2010, and deep learning was not nearly as popular as it is now3. Moreover, no one made much effort linking to neural networks in calculus or linear algebra curricula. Even folks doing classical optimization and statistics sometimes seemed puzzled when hearing about neural nets. Some year later, I have implemented a simple neural net with sigmoid activations as a part of a decision-making course. At the time I realized that existing state of our knowledge was still lacking to really build "thinking computers"4.

It was 2012, the conference in Crimea, Ukraine where I attended a brilliant talk by Prof. Laurent Larger. He explained how to build a high-speed hardware for speech recognition using a laser. The talk really inspired me, and a year later I have started a PhD with an aim to develop reservoir computing, recurrent neural networks implemented directly in hardware. Finally, now I am using deep neural networks as a part of my job.

In this series of posts I will highlight some curious details of problem solving with neural networks. I am pretty much against duplicate efforts, therefore I would avoid repeating aspects described many times somewhere else. For those who are completely new to the subject, instead of introducing neural networks all over again I would refer to Chapter 1.2 of my PhD thesis. Here I will only summarize that neural networks are to some extent inspired by biological neurons. Similarly to its biological counterpart, artificial neuron receives many inputs, performs a nonlinear transformation, and produces an output. The equation below formalizes this kind of behavior:

$$ \begin{equation} y = f(w_1 x_1 + w_2 x_2 + \dots + w_N x_N) = f(\sum_i w_i x_i), \end{equation} $$

where $N$ is the number of inputs $x_i$, $w_i$ are synaptic weights5, and $y$ is the result. Surprising as it may appear, in a modern neural network $f$ can be practically any nonlinear function. This nonlinear function is often called an activation function. Congratulations, we have arrived at a neural network from 1950-s.

To continue reading this article, some mathematical background is useful, but not mandatory. Anyone can develop an intuition about neural networks!

A Word On Haskell

This series of posts practically illustrate all the concepts in Haskell programming language. To motivate you, here are some Q & A you always wanted to know.

Q: Is there anything that makes Haskell particularly good for Neural Networks or are you simply doing it because you prefer to use Haskell?

A: Neural networks are very "function objects". A network is just a big composition of functions. These kinds of things are very natural in a functional language.

Q: As I am completely new to Haskell, would like to know what are the benefits of using Haskell vs python or other languages?

A: The benefits of using Haskell:

  1. It is much easier to reason about what your program is doing.
  2. Great when refactoring existing code base.
  3. Haskell shapes your thinking towards problem solving in a pragmatic way.
  4. Haskell programs are fast.

Q: I figured my lack of Haskell knowledge would make it hard to read, but your code examples still make sense to me.

A: Thank you. I find Haskell to be very intuitive when explaining neural nets.

Gradient Descent: A CS Freshman Year

Gradient Descent. Image credit

The very basic idea behind neural networks training and deep learning is a local optimization method known as gradient descent. For those who have hard time remembering their freshman year, just watch an introductory video explaining the idea.

How does a concept as simple as gradient descent work for a neural network? Well, a neural network is only a function6, mapping an input to some output. By comparing the neural network's output to some desired output, one can obtain another function known as an error function. This error function has a certain error landscape, like in the mountains. By using gradient descent, we modify our neural network in such a way that we are descending this landscape. Therefore, we aim at finding an error minimum. The key concept of the optimization method is that the error gradients give us a direction in which to change the neural network. In a similar way one would be able to descend a hill covered with a thick fog. In both cases only a local gradient (slope) is available.

Gradient descent can be described by a formula:

$$\begin{equation} x_{n+1} = x_n - \gamma \cdot \nabla F(x_n), \end{equation} $$

where constant $\gamma$ is what is referred to in deep learning as the learning rate, i.e. the amount of learning per iteration $n$. In the simplest case, $x$ is a scalar variable. The gradient descent method can be implemented in few lines of code. The Haskell code snippet below is interactive. Press the green triangle button to Run the code.

That outputs the following sequence: $0.0,2.4,2.88,2.976,2.9952,2.99904,2.999808,2.9999616,\dots$ Indeed, the value minimizing function $f(x)$ is $3$, i.e. $\min f(x) = f(3) = (x-3)^2 = 0$. And the sequence is gradually converging towards that number. Let us look more carefully what does the code above do.

Lines 1-3: We define the gradient descent method, which iteratively applies the function step implementing equation (2). We provide the intermediate results taking the first iterN values.

Line 5: Suppose, we would like to optimize a function $f(x) = (x-3)^2$. Its gradient gradF_test is then $\nabla f(x) = 2\cdot(x-3)$.

Lines 7-10: Finally, we run our gradient descent using learning rate $\gamma = 0.4$.

It is crucial to realize that the value of $\gamma$ affects the convergence. When $\gamma$ is too small, the algorithm will take many more iterations to converge, however, when $\gamma$ is too large the algorithm will never converge. At the moment, I am not aware of a good way how to determine the best $\gamma$ for a given problem. Therefore, often different $\gamma$ values have to be tried. Feel free to modify the code above and see what comes out! For example, you can try out different values of gamma, such as gamma = 0.01, gamma = 0.1, gamma = 1.1. The method is generalizable to $N$ dimensions. Essentially, we would replace the gradF_test function with the one operating on vectors rather than scalars.

Neural Network Ingredients For Classification

Iris plant. Image credit David Iliff (CC-BY-SA 3.0)

Now that we realize how the gradient descent works, we may want to train a moderately useful network. Let's say we want to perform Iris flower classification using four distinctive features7: sepal length, sepal width, petal length, petal width. There are three classes of flowers we want to be able to recognize: Setosa, Versicolour, and Virginica. Now, there is a problem: how do we encode those three classes so that our neural network can handle them?

Naive Solution And Why It Does Not Work

The most simple solution to indicate each species would be using natural numbers. For instance, Iris Setosa can be encoded as 1, Versicolour, as 2, and Virginica, as 3. There is, however, a problem with this kind of encoding: we impose a bias. First, by encoding those classes as numbers, we impose a linear order over those three classes. It means, we start our count with Setosa, then, Versicolour, and then we arrive at Virginica. However, in reality it doesn't really matter if we end with Virginica or Vernicolour. Second, we also assume that the distance between Virginica and Setosa 3 - 1 = 2 is larger than between Virginica and Versicolor 3 - 2 = 1, which is a priory wrong.

One-Hot Encoding

So which kind of encoding do we need? First, we want not to impose any restriction on ordering and second, we want the distances between classes to be equal. Therefore, we would prefer encoding each class to be orthogonal, i.e. independent from the other two. That becomes possible if we use vectors of three dimensions (as there are three classes). Therefore, now Setosa class is encoded as $[1, 0, 0]$, Versicolour, as $[0, 1, 0]$, and Virginica as $[0, 0, 1]$. The Euclidean distance between any pair of classes is equal to $\sqrt 2$. Update8: For example, the distance between Setosa and Versicolour is computed as

$$\sqrt{(1-0)^2 + (0-1)^2 + (0-0)^2}=\sqrt{2}.$$

Putting It All Together

Now that we are familiar with basic neural networks and gradient descent and also have some data to play with7, let the fun begin!

First, we create a network of three neurons. To do that, we generalize formula (1):

$$\begin{equation} y_{i}=f (\sum_k w_{ik} x_k), \end{equation} $$ where $(x_1, x_2, x_3, x_4) = \mathbf{x}$ is a 4D input vector, $w_{ik} \in \mathbf{W}, i=1 \dots 3, k=1 \dots 4$ is synaptic weights matrix, and result $(y_1, y_2, y_3) = \mathbf{y}$ is a 3D vector. Generally speaking, we perform a matrix-vector multiplication with a subsequent element-wise activation:

$$\begin{equation} \mathbf y = f (\mathbf {W x}). \end{equation} $$

As a nonlinear activation function $f$ we will use the sigmoid function9 $\sigma(x) = [1 + e^{-x}]^{-1}$. We will exploit hmatrix Haskell library for linear algebra operations such as matrix multiplication. With hmatrix, Equation (4) can be written as:

import Numeric.LinearAlgebra as LA

sigmoid = cmap f
    f x = recip $ 1.0 + exp (-x)

forward x w =
  let h = x LA.<> w
      y = sigmoid h
  in [h, y]

where <> denotes the matrix product function from LA module. Note that x can be a vector, but it can be also a dataset matrix. In the latter case, forward will transform our entire dataset. Notice that we provide not only the result of our computation y, but also an intermediate step h since it will be later reused for w gradient computation.

Each neuron $y_i$ is supposed to fire when it 'thinks' that it has detected one of the three species. E.g. when we have an output [0.89, 0.1, 0.2], we would assume that the fist neuron is the most 'confident', i.e. we interpret the result as Setosa. In other words, this output is treated as similar to [1, 0, 0]. As you can see, the maximal element was set to one and others, to zero. This is a so-called 'winner takes all' rule.

Before training the neural network, we need some measure of error or loss function to minimize. For instance, we can use the Euclidean loss $\text{loss} = \sum_i (\hat y_i - y_i)^2$ where $\hat y_i$ is a prediction and $y_i$ is a real answer from our dataset:

loss y tgt =
  let diff = y - tgt
  in sumElements $ cmap (^2) diff

For the sake of gradient descend illustration, we will reuse the descend function defined above. Now, we have to specify the gradient function for our neural network equation (4). We use what is called a backpropagation or shortly backprop method, which is essentially a result of the chain rule and is illustrated for an individual neuron in the figure below10.

First, in the forward pass, initial output $y$ is calculated. Then, this output is compared to some desired output and the error gradient $dy$ is passed back. Afterwards, the activation function gradient $df$ is obtained using $dy$. That ultimately leads to the remaining gradients $dw_1$, $dx_1$, $dw_2$, $dx_2, \dots .$

Backpropation for a single neuron.

First, in the forward pass, initial output $y$ is calculated. Then, this output is compared to some desired output and the error gradient $dy$ is passed back. Afterwards, the activation function gradient $df$ is obtained using $dy$. That ultimately leads to the remaining gradients $dw_1$, $dx_1$, $dw_2$, $dx_2, \dots .$

Now we can calculate the weights gradient dW using the backprop method from above:

grad (x, y) w = dW
    [h, y_pred] = forward x w
    dE = loss' y_pred y
    dY = sigmoid' h dE
    dW = linear' x dY

Here linear', sigmoid', loss' are gradients of linear operation (multiplication), sigmoid activation $\sigma(x)$, and the loss function. Note that by operating on matrices rather than scalar values we calculate the gradients vector dW denoting every synaptic weight gradient $dw_i$. Below are those "vectorized" functions definitions in Haskell using hmatrix library11:

linear' x dy = cmap (/ m) (tr' x LA.<> dy)
    m = fromIntegral $ rows x

sigmoid' x dY = dY * y * (ones - y)
    y = sigmoid x
    ones = (rows y) >< (cols y) $ repeat 1.0

loss' y tgt =
  let diff = y - tgt
  in cmap (* 2) diff

To test our network, we download the dataset here (there are two files: x.dat and y.dat) and the code here. As instructed in the comments, we run our program:

$ stack --resolver lts-10.6 --install-ghc runghc --package hmatrix- Iris.hs
Initial loss 169.33744797846379
Loss after training 61.41242708538934
Some predictions by an untrained network:
[ 8.797633210095851e-2, 0.15127581829026382, 0.9482351750129188
,  0.11279346747947296,  0.1733431584272155, 0.9502442520696124
,  0.10592462402394615, 0.17057190568339017, 0.9367875655363787
,  0.10167941966201806, 0.20651101803783944, 0.9300343579182122
, 8.328154248684484e-2, 0.15568011758813116,  0.940816298954776 ]
Some predictions by a trained network:
[ 0.6989749292681016, 0.14916793398555747,  0.1442697900857393
,  0.678406436711954,  0.1691062984304366,  0.2052955124240905
, 0.6842327447503195, 0.16782087736820395, 0.16721778476233148
, 0.6262988163006756, 0.19656943129188192, 0.17521133197774072
, 0.6905553549763312, 0.15299944611286123, 0.12910826989854146 ]
[ 1.0, 0.0, 0.0
, 1.0, 0.0, 0.0
, 1.0, 0.0, 0.0
, 1.0, 0.0, 0.0
, 1.0, 0.0, 0.0 ]

That's all for today. Please feel free to play with the code. Hint: you may have noticed that grad calls sigmoid twice on the same data: once in forward and once in sigmoid'. Try optimizing the code to avoid this redundancy.

As soon as you understand the basics of neural networks, make sure you continue to Day 2. In the next post you will learn how to make your neural network operational. First of all, we will highlight the importance of multilayer structure. We will also show that nonlinear activations are crucial. Finally, we will improve neural network training and discuss weights initialization.

  1. There even exists a term describing exactly what I dreamed to achieve. ^
  2. For instance, check Why people think computers can't. ^
  3. With Google Trends in hand, we can witness the raise of global deep learning interest. ^
  4. Despite people have fantasized about it long before Alan Turing. ^
  5. Essentially, a synaptic weight $w_i$ determines the strength of connection to the $i$-th input. ^
  6. With recurrent neural networks it is not true. Those have an internal state, making such networks equivalent to computer programs, i.e. potentially more complex than maps. ^
  7. Here we refer to the classical Iris dataset. If you prefer another light dataset please let me know. ^
  8. Kudos to Peter Harpending who spotted a typo in Euclidean distance. ^
  9. In this example nonlinear activation is not essential. However, as we will see in future posts, in multilayer neural networks nonlinear activations are strongly required. ^
  10. We look closer at the backpropagation mechanics here. ^
  11. And here are their mathematical derivations. ^
Next: Day 2: What Do Hidden Layers Do?