Neural Nets - Backpropagation

Back to talk about NN a bit more! We talked about linear and nonlinear decision boundaries last time we left off. This is why multi-layered NNs have at least one nonlinear function applied to the input if not many (even many different nonlinear functions). So how do we even start using a Neural Network? It’s beneficial to talk about the input before we talk about all the things we do to that input.

Input for Neural Networks

Let’s say we have the sentence, “Neural networks rock”. We want to take every sequence of 5 characters and make it into a input feature into our Neural Network. How do we make this feature? We create something called a one hot encoder. This is essentially a binary vector that has length equal to all of the possible letters in our vocabulary. If a certain letter’s position is in the text, we have a one. If our vocabulary was just the english alphabet and spaces, we’d have a vector of length 27, one for each letter and one for a space.

So now let’s take the first 5 characters of our text, “Neura” and transform it. We will create a vector for each letter. The first vector would be 0 for all 27 positions except the 14th position. The second would be 0 everywhere except the 5th position, so on and so forth. We take these 5 vectors of length 27 and combine them into one vector of length 5*27. For now on, we will let the variable \(c:=27\) so that we have a vector of length 5c. We then make another 5c length vector for ‘eural’, one for ‘ural’, and we continue until we have one for ‘works’. This makes 11 vectors of length 5c. Each 5c vector will be a single input for our Neural Network

Arcitecture for Neural Networks

So we have our input. It’s a vector of length 5c that is 0 everywhere except in 5 positions. What do we do with this? Well, in the first post, we have weights and biases to multiply with our data (eg the 5c length vector) to make a unit. Here, we usually have a weight matrix such that we can do matrix multiplication and have many units. Then we do a non-linear transformation to the result of that, so on and so forth. For example, let’s have the following weight vectors and biases to predict three languages: English, Italian, and French.

\(W^{1}\) is a weight matrix of dimension \(5c,d\)
\(b^{1}\) is a bias vector of dimension \(d, 1\)
\(W^{2}\) is a weight matrix of dimension \(3,d\)
\(b^{2}\) is a bias vector of dimension \(3,1\)
\(x\) is our input vector of length 5c

Note that we need to figure out which matrices and vectors best transform the input such that we get the right output. We assign the values randomly. If our output is super far off, we should change the values of our matrices. If our output is spot on, then our matrices are probably pretty good as is.

So what do we do with this? Do some matrix multiplication (linear transformation), transform (nonlinear) the result, do more matrix multiplication (linear transformation), etc. In this case, we have the following nonlinear transformations

\(h_{linear}:= W^{1}x + b^{1}\)
\(h = \sigma(h_{linear}); \sigma(z) = \frac{1}{1+e^{-z}}\)
\(y_{linear} = W^{2}h + b^{2}\)
\(\hat{y} = softmax(y_{linear}); softmax(y)[i] = \frac{e^{y[i]}}{\sum_{j}{e^{y[j]}}}\)

The softmax function makes a “probability distribution” vector, so that we can see which language has the maximum probability. We do this for every input vector in the sentence. Now, we have 11 different probability distributions. How should we predict one language? You can either predict for each of the 11 vectors and then take the majority vote, or maybe take the average of them all and then take the maximum. Or you can do something else, as long as it makes sense :)

Improving our Neural Network

So, we have our averaged probability distribution, \(\hat{y}\). We can compare it to the ground truth, a vector of length 3. It has a 1 based on the language eg French could be [1, 0, 0], english [0, 1, 0], and italian [0, 0, 1]. To see how incorrect we are, we need a loss function. A loss function takes our prediction and our true label, and gives a numerical sense of how far away we were. Let’s choose the function \(l(\hat{y}, y) = \frac{1}{2} \sum_{i} (\hat{y}[i] - y[i])\). If our \(\hat{y}\) was something like [.25, .25, .5] and our label was [1, 0, 0], we’d be pretty far off from the truth. So what do we do? As mentioned before, we assigned the values of our Weight and bias matrices randomly. Since we were so far off, we should probably adjust them and hope our next ouput is closer to the ground truth. How do we do this? Backpropagationnnnnnn

Backpropagation

Note that this part will be pretty technical behind the scenes. I’ve tried to condense it to be as simple as possible, so it may be lacking some information.

We’re currently facing this problem of wanting to adjust our weights and biases. Well, the degree we need to change them depends on how incorrect we are. More incorrect implies a more drastic change. So basically, we should take some information from our loss and use that as a means to figure out just how much we should change. This is a key aspect in something called Stochastic Gradient Descent (SGD). Say we’re looking at a the function \(f(x) = x^{2}\). The bottom of the parabola is at x = 0. Let’s say we’re at x = -3 and we want to get to the bottom. Well, we should go to the right, correct? If we were at x = 3, we sould travel to the left to get to the bottom. If we decide to go left, but take huge of length 5, then our next step would lead us to x = -2. So then we’d need to go right again. This isn’t ideal. What if we took steps of length .1? If we started at x = 3 and went left, we’d eventually get to the bottom, but it’d take forever. This is pretty much what we’re doing with stochastic gradient descent.

We have a funtion and we want to find the minimum. We first determine which direction we should travel in and how large our steps should be. We take the step. See how far away we are now, determine the direction we should go (if we happened to overshoot), and try again. Eventually, we (hopefully) get close enough to say we’ve approximated the minimum. This is also what we’re doing with backpropagation. We see how far away we are from the true label. We determine our drastic we will change the weights (this is called the learning rate and is fixed in our case) and take something called the gradient to figure out which direction we should move in, then adjust the weights! In a formula, we’d do something like \(W^{1} = W^{1} - \alpha \Delta_{W^{1}} L\) where \(\alpha\) is the learning rate and \(\Delta_{W^{1}} L\) is the gradient with respect to \(W^{1}\). We do this for all our weights and matrices for each \(\hat{y}\) we get. Then we take our updated weights, and try again with predicting our target. If we are a bit closer, then the next time we do SGD, the weights might not change as drastically. It’s influenced by our loss. We continue this process until we go through all of the examples in our data, which is referred to as an epoch. Hopefully, our model has improves such that when you try to predict the language of a sentence, it will be spot on. :) Neural networks can be used for many things other than a language identifier. It can be used to classify documents in general (tweets, paragraphs, songs, posts, etc), it’s used in computer vision, so on and so forth. I hope this helps you understand a bit more about NN. If you want an example of how to implement one from scratch, check out my repository on github! It even has the derivation that explain how to do backpropogation in the ipynb file while the finished code is in the py file. Happy machine learning! (insert bicep emoji)