Lesson 3: Practical Deep Learning for Coders 2022

[Jeremy Howard]

Lesson 3 Overview

Hi everybody and welcome to Lesson 3 of Practical Jig Learning for Coders.

Course Pace Feedback

We did a quick survey this week to see how people feel the course is tracking. And over half of you think it’s about right pace and of the rest who aren’t, some of you think it’s a bit slow and some of you think it’s a bit fast. So hopefully that’s about the best we can do. Generally speaking, the first two lessons are a little more easy pacing for anybody who’s already familiar with the basic technology pieces. And then the later lessons get more into some of the foundations. Today we’re going to be talking about things like the matrix multiplications and gradients and calculus and stuff like that. So for those of you who are more mathy and less computery, you might find this one more comfortable and vice versa.

[00:01:01]

So remember that there is an official course updates thread where you can see all of the up-to-date info about everything you need to know, and of course the course website as well.

Course Updates and Forum

So by the time you watch the video of the lesson, it’s pretty likely that if you come across a question or an issue, somebody else will have. So definitely search the forum and check the facts first, and then of course feel free to ask a question yourself on the forum if you can’t find your answer. One thing I did want to point out which you’ll see in the lesson thread and the course website is there is also a Lesson Zero.

Lesson Zero and Meta-Learning

Lesson Zero is based heavily on Radek’s book, Meta-Learning, which internally is based heavily on all the things that I’ve said over the years about how to learn fast. We try to make the course full of tidbits about the science of learning itself and put them into the course.

[00:02:05]

It’s a different course to probably any other you’ve taken, and I strongly recommend watching Lesson Zero as well. The last bit of Lesson Zero is about how to set up a Linux box from scratch, which you can happily skip over unless that’s of interest, but the rest of it is full of juicy information that I think you’ll find useful.

Fast.ai Lesson Approach

So the basic idea of what to do to do a fast.ai lesson is watch the lecture. I generally, on the video, recommend watching it all the way through without stopping once and then go back and watch it with lots of pauses, running the notebook as you go. Because otherwise you’re kind of like running the notebook without really knowing where it’s heading, if that makes sense. The idea of running the notebook is there’s a few notebooks you could go through.

[00:03:00]

So obviously there’s the book. So going through Chapter 1 of the book, going through Chapter 2 of the book as notebooks, running every code cell and experimenting with inputs and outputs to try and understand what’s going on. And then trying to reproduce those results. And then trying to repeat the whole thing with a different data set. If you can do that last step, that’s quite a stretch goal, particularly at the start of the course because there’s so many new concepts, but that really shows that you’ve got it sorted. Now for this third bit, Reproduce Results, I recommend using, you’ll find in the fastbook repo, so the repository for the book, there is a special folder called Clean.

Clean Notebooks for Self-Study

And Clean contains all of the same chapters of the book, but with all of the text removed except the headings and all of the outputs removed. And this is a great way for you to test your understanding of the chapter, is before you run each cell, try to say to yourself, okay what’s this for and what’s it going to output, if anything.

[00:04:06]

And if you kind of work through that slowly, that’s a great way, and any time you’re not sure, you can jump back to the version of the notebook with the text to remind yourself and then head back over to the Clean version. So there’s an idea for something which a lot of people find really useful for self-study.

Importance of Social Learning

I say self-study, but of course, as we’ve mentioned before, the best kind of study is study done to some extent with others, for most people. The research shows that you’re more likely to stick with things if you’re doing it as kind of a bit of a social activity. The forums are a great place to find and create study groups, and you’ll also find on the forums a link to our Discord server, where there are some study groups there as well.

[00:05:01]

So in-person study groups, virtual study groups are a great way to really make good progress and find other people at a similar level to you. If there’s not a study group going at your level, in your area, in your time zone, create one. Just post something saying, hey let’s create a study group.

Forum Highlights and Projects

So this week there’s been a lot of fantastic activity. I can’t show all of it, so what I did was I used the summary functionality in the forums to grab all of the things with the highest votes, so I’ll just quickly show a few of those. We have a Marvel detector created this week, identify your favorite Marvel character. I love this, a rock-paper-scissors game where you actually use pictures of the rock-paper-scissors symbols, and apparently the computer always loses, that’s my favorite kind of game. There is a lot of Elon around, so very handy to have an Elon detector to either find more of him, if that’s what you need, or maybe less of him.

[00:06:07]

I thought this one was very interesting. I love these kind of interesting ideas, like gee I wonder if this would work. Can you predict the average temperature of an area based on an aerial photograph? And apparently the answer is yes, you can predict it pretty well. Here in Brisbane, it was predicted I believe to be within 1.5 Celsius. I think this student is actually a genuine meteorologist, if I remember correctly. He built a cloud detector. So then building on top of the what’s your favorite Marvel character, there’s now also an is it a Marvel character. My daughter loves this one, what dinosaur is this? I’m not as good about dinosaurs as I should be. I feel like there’s 10 times more dinosaurs than there was when I was a kid, so I never know their names, so this is very handy.

[00:07:02]

This is cool, choose your own adventure where you choose your path using facial expressions. I think this music genre classification is also really cool. Brian Smith created a Microsoft PowerApp application that actually runs on a mobile phone, so that’s pretty cool. I wouldn’t be surprised to hear that Brian actually works at Microsoft, so also an opportunity to promote his own stuff there. I thought this art movement classifier was interesting in that there’s a really interesting discussion on the forum about what it actually shows, about similarities between different art movements. And I thought this redaction detector project was really cool as well, and there’s a whole tweet thread and blog post and everything about this one, a great piece of work. So I’m going to quickly show you a couple of little tips before we jump into the mechanics of what’s behind a neural network, which is playing a little bit with how do you make your neural network more accurate during the week.

[00:08:15]

Pet Detector Project

And so I created this pet detector, and this pet detector is not just predicting dogs or cats, but what breed is it. That’s obviously a much more difficult exercise. Now because I put this out on Hugging Face Spaces, you can download and look at my code, because if you just click Files and Versions on the space, which you can find a link on the forum and the course website, you can see them all here and you can download it to your own computer. So I’ll show you what I’ve got here. Now one thing I’ll mention is today I’m using a different platform, so in the past I’ve shown you Colab and I’ve shown you Kaggle, and we’ve also looked at doing stuff on your own computer, not so much training models on your computer, but using the models you’ve trained to do applications.

[00:09:16]

PaperSpace Gradient Notebooks

PaperSpace is another website a bit like Kaggle and Google, but in particular they have a product called Gradient Notebooks, which is, at least as I speak, and things change all the time, so check the course website, but as I speak, in my opinion is by far the best platform for running this course and for doing experimentation. I’ll explain why as we go. So why haven’t I been using it the past two weeks? Because I’ve been waiting for them to build some stuff for us to make it particularly good and they’ve just finished. I’ve been using it all week and it’s totally amazing. This is what it looks like.

[00:10:01]

So you’ve got a machine running in the cloud, but the thing that’s very special about it is it’s a real computer you’re using. It’s not like that kind of weird virtual version of things that Kaggle or Colab has. So if you whack on this button down here, you’ll get a full version of JupyterLab, or you can switch over to a full version of classic Jupyter notebooks. I’m actually going to do stuff in JupyterLab today because it’s a pretty good environment for beginners who are not familiar with the terminal, which I know a lot of people in the course are in that situation. You can do really everything graphically, there’s a file browser, so here you can see I’ve got my Pets repo, it’s got a Git repository thing you can pull and push to Git. And then you can also open up a terminal, create new notebooks, and so forth.

[00:11:06]

So what I tend to do with this is I tend to go into a full screen, which is kind of like its own whole IDE. And so you can see I’ve got here my terminal, and here’s my notebook. They have free GPUs, and most importantly there’s two good features. One is that you can pay, I think it’s $8 or $9 a month to get better GPUs, basically as many hours as you want. And they have persistent storage. So with Colab, if you’ve played with it, you might have noticed it’s annoying, you have to muck around with saving things to Google Drive and stuff. On Kaggle there isn’t really a way of having a persistent environment, whereas on PaperSpace you have whatever you save in your storage, it’s going to be there the next time you come back.

[00:12:02]

I’m going to be adding walkthroughs of all of this functionality, so if you’re interested in really taking advantage of this, check those out.

Key Concepts from Lesson 2

So I think the main thing that I wanted you to take away from Lesson 2 isn’t necessarily all the details of how to use a particular platform to train models and deploy them into applications through JavaScript or online platforms. But the key thing I wanted you to understand was the concept. There’s really two pieces. There’s the training piece, and at the end of the training piece you end up with this model.pickle file. And once you’ve got that, that’s now a thing where you feed it inputs and it spits out outputs based on that model that you trained. Because that happens pretty fast, you generally don’t need a GPU once you’ve got that trained.

[00:13:02]

So then there’s a separate step which is deploying. So I’ll show you how I trained my pet classifier.

Pet Classifier Training

So you can see I’ve got two IPython notebooks. One is app, which is the one that’s going to be doing the inference and production, one is the one where I train the model. So this first bit I’m going to skip over because you’ve seen it before. I create my image data loaders, check that my data looks okay with showBatch, train a ResNet34, and I get 7% accuracy. So that’s pretty good. But check this out. There’s a link here to a notebook I created, actually most of the work was done by Ross Whiteman, where we can try to improve this by finding a better architecture.

[00:14:03]

Exploring Architectures with PyTorch Image Models

There are, I think at the moment in the PyTorch image models library, over 500 architectures and we’ll be learning over the course what they are, how they differ. But broadly speaking, they’re all mathematical functions, which are basically matrix multiplications and these nonlinearities such as ReLUs that we’ll talk about today. Most of the time those details don’t matter, what we care about is three things. How fast are they, how much memory do they use, and how accurate are they. What I’ve done here with Ross is we’ve grabbed all of the models from PyTorch image models and you can see all the code we’ve got, there’s very, very little code, to create this plot.

[00:15:00]

And so on this plot, on the x-axis we’ve got seconds per sample, so how fast is it, to the left is better, it’s faster. And on the right is how accurate is it, so how accurate was it on ImageNet in particular. And so generally speaking, you want things that are up towards the top and left. Now we’ve been mainly working with ResNet, and you can see down here, here’s ResNet-18. ResNet-18 is a particularly small and fast version for prototyping. We often use ResNet-34, which is this one here. And you can see this kind of classic model that’s very widely used, actually nowadays isn’t the state of the art anymore. So we can start to look up at these ones up here and find out some of these better models. The ones that seem to be the most accurate and fast are these Levitt models.

[00:16:00]

So I tried them out on my pets and I found that they didn’t work particularly well, so I thought okay, let’s try something else out.

ConvNext Models for Improved Accuracy

So next up I tried these ConvNext models. And this one in here was particularly interesting, it’s kind of like super high accuracy. If you want .001 seconds inference time, it’s the most accurate. So I tried that. So how do we try that? All we do is I can say, so the PyTorch image models is in the TIM module, so at the very start I imported that. And we can say ListModels and pass in a glob, a match, and so this is going to show all the ConvNext models. And here I can find the ones that I just saw. And all I need to do is when I create the VisionLearner, I just put the name of the model in as a string. So you’ll see earlier this one is not a string, that’s because it’s a model that Fast.ai provides, the library.

[00:17:08]

Fast.ai only provides a pretty small number. So if you install TIM, so you’ll need to pip install TIM or conda install TIM, you’ll get hundreds more and you put that in a string. So if I now train that, the time for these epochs goes from 20 seconds to 27 seconds, so it is a little bit slower, but the accuracy goes from 7.2% down to 5.5%. So that’s a pretty big relative difference. 7.2 divided by 5.5, so about a 30% improvement. So that’s pretty fantastic. It’s been a few years, honestly, since we’ve seen anything really beat ResNet that’s widely available and usable on regular GPUs.

[00:18:00]

So this is a big step. There’s a few architectures nowadays that really are probably better choices a lot of the time. So if you are not sure what to use, try these confnext architectures. You might wonder what the names are about, obviously tiny, small, large, etc. is how big is the model. So that will be how much memory is it going to take up and how fast is it. And then these ones here, N22FT1K, these ones have been trained on more data. So ImageNet, there’s two different ImageNet datasets, there’s one that’s got 1000 categories of pictures and there’s another one that’s got 22,000 categories of pictures. So this is trained on the one with 22,000 categories of pictures. So these are generally going to be more accurate on kind of standard photos of natural objects. So from there I exported my model and that’s the end.

[00:19:01]

Model Export and Application

So now I’ve trained my model and I’m all done. Other things you could do is add more epochs, for example, add image augmentation, there’s various things you can do. But you know, I found this is actually pretty hard to beat this by much. If any of you find you can do better, I’d love to hear about it. So then to turn that into an application, I just did the same thing that we saw last week, which was to load the learner. Now this is something I did want to show you.

Understanding the Model.pickle File

The learner, once we load it and call predict, spits out a list of 37 numbers. That’s because there are 37 breeds of dog and cat. So these are the probability of each of those breeds. What order are they in? That’s an important question. The answer is that Fast.ai always stores this information about categories, this is a category in this case of dog or cat breed, in something called the vocab object, and it’s inside the data loaders.

[00:20:03]

So we can grab those categories, and that’s just a list of strings, it just tells us the order. So if we now zip together the categories and the probabilities, we’ll get back a dictionary that tells you, like so. So here’s that list of categories, and here’s the probability of each one. And this was a Basset hound, so there you can see, almost certainly a Basset hound. So from there, just like last week, we can go and create our interface and then launch it, and there we go. So what did we just do really? What is this magic model.pickle file? So we can take a look at the model.pickle file. It’s an object called a learner, and a learner has two main things in it. The first is the list of preprocessing steps that you did to turn your images into things of the model.

[00:21:06]

And that’s basically this information here. So it’s your data blocks or your image data loaders or whatever.

Exploring Model Layers and Parameters

And then the second thing, most importantly, is the trained model. And so you can actually grab the trained model by just grabbing the .model attribute, so I’m just going to call that M. And then if I type M, I can look at the model. And so here it is, lots of stuff. So what is this stuff? Well we’ll learn about it all over time, but basically what you’ll find is it contains lots of layers, because this is a deep learning model. And you can see it’s kind of like a tree, that’s because lots of the layers themselves consist of layers. So there’s a whole layer called the TIM body, which is most of it.

[00:22:00]

And then right at the end there’s a second layer called sequential. And then the TIM body contains something called model, and then it contains something called stem, and something called stages, and then stages contain 0, 1, 2, etc. So what is all this stuff? Well let’s take a look at one of them. So to take a look at one of them, there’s a really convenient method in PyTorch called getSumModule, where we can pass in a dotted string navigating through this hierarchy. So 0, model, stem, 1, goes 0, model, stem, 1. So this is going to return this layerNorm2D thing. So what is this layerNorm2D thing? Well the key thing is it’s got some code, it’s the mathematical function that we talked about. And then the other thing that we learned about is it has parameters. So we can list its parameters and look at this. It’s just lots and lots and lots of numbers.

[00:23:03]

Let’s grab another example. We could have a look at 0.model.stages.0.blocks.1.mlp.fc1 and parameters, another big bunch of numbers. So what’s going on here? What are these numbers and where on earth did they come from and how come these numbers can figure out whether something is a basset hound or not.

How Does a Neural Network Really Work?

So to answer that question, we’re going to have a look at a Kaggle notebook, how does a neural network really work. I’ve got a local version of it here which I’m going to take you through.

Machine Learning as Function Fitting

The basic idea is machine learning models are things that fit functions to data. So we start out with a very, very flexible, in fact an infinitely flexible as we’ve discussed function, a neural network, and we get it to do a particular thing, which is to recognize the patterns in the data examples we give it.

[00:24:13]

So let’s do a much simpler example than a neural network.

Quadratic Function Example

Let’s do a quadratic. So let’s create a function f which is 3×2 plus 2x plus 1. So it’s a quadratic with coefficients 3, 2 and 1. So we can plot that function f and give it a title. If you haven’t seen this before, things between dollar signs is what’s called LaTeX, it’s basically how we can create typeset mathematical equations. So let’s run that. So here you can see the function, here you can see the title, I passed it, and here is our quadratic. So what we’re going to do is we’re going to imagine that we don’t know that’s the true mathematical function we’re trying to find.

[00:25:04]

It’s obviously much simpler than the function that figures out whether an image is a basset hound or not. We’re just going to start super simple. So this is the real function and we’re going to try to recreate it from some data.

Reconstructing the Quadratic Function

Now it’s going to be very helpful if we have an easier way of creating different quadratics. So I’ve defined a general form of a quadratic here with coefficients a, b and c and at some particular point x, it’s going to be ax2 plus bx plus c. So let’s test that. So that’s for x equals 1.5, that’s 3×2 plus 2x plus 1, which is the quadratic we did before. Now we’re going to want to create lots of different quadratics to test them out and find which one’s best. So this is a somewhat advanced but very, very helpful feature of Python that’s worth learning if you’re not familiar with it.

[00:26:05]

It’s used in a lot of programming languages, it’s called a partial application of a function. Basically I want this exact function, but I want to fix the values of a, b and c to pick a particular quadratic. And the way you fix the values of the function is you call this thing in Python called partial and you pass in the function and then you pass in the values that you want to fix. So for example, if I now say make quadratic 3, 2, 1, that’s going to create a quadratic equation with coefficients 3, 2, and 1. And you can see if I then pass in, so that’s now f, if I pass in 1.5, I get the exact same value I did before. So we’ve now got an ability to create any quadratic equation we want by passing in the parameters of the coefficients of the quadratic.

[00:27:01]

That gives us a function that we can then just call just like any normal function. So that only needs one thing now, which is the value of x, because the other 3, a, b and c are now fixed. So if we put that function, we’ll get exactly the same shape because it’s the same coefficients. So now I’m going to show an example of some data, some data that matches the shape of this function. But in real life, data is never exactly going to match the shape of a function, it’s going to have some noise. So here’s a couple of functions to add some noise. So you can see I’ve still got the basic functional form here, but this data is a bit dotted around it. The level to which you look at how I implemented these is entirely up to you, it’s not like super necessary, but it’s all stuff, the kind of things we use quite a lot.

[00:28:05]

So this is to create normally distributed random numbers. This is how we set the seed so that each time I run this I get the same random numbers. This one is actually particularly helpful. This creates a tensor, in this case a vector, that goes from negative 2 to 2 in equal steps and there’s 20 of them. That’s why there’s 20 steps along here. So then my y values is just f of x with this amount of noise added. So as I say, the details of that don’t matter too much. The main thing to know is we’ve got some random data now.

Adding Noise to Data

The idea is now we’re going to try to reconstruct the original quadratic equation, find one which matches this data. So how would we do that?

[00:29:01]

Reconstructing the Quadratic with Interactive Sliders

What we can do is we can create a function called plotQuadratic that first of all plots our data as a scatterplot, and then it plots a function which is a quadratic, a quadratic we pass in. Now there’s a very helpful thing for experimenting in Jupyter notebooks, which is the atInteract function. If you add it on top of a function, then it gives you these nice little sliders. So here’s an example of a quadratic with coefficients 1.5, 1.5, 1.5, and it doesn’t fit particularly well. So how would we try to make this fit better? I think what I’d do is I’d take the first slider and I would try moving it to the left and see if it looks better or worse. That looks worse to me, I think it needs to be more curvy, so let’s try the other way.

[00:30:00]

Yeah, that doesn’t look bad. Let’s do the same thing for the next slider. How about this way? No, I think that’s worse. Let’s try the other way. Okay, final slider. Try this way. No, that’s worse. This way. So you can see what we can do, we can basically pick each of the coefficients one at a time, try increasing a little bit, see if that improves it, try decreasing it a little bit, see if that improves it, find the direction that improves it, and then slide it in that direction a little bit. And then when we’re done, we can go back to the first one and see if we can make it any better. Now we’ve done that. And actually you can see that’s not bad, because I know the answer is meant to be 3, 2, 1, so they’re pretty close. And I wasn’t cheating, I promise. That’s basically what we’re going to do, that’s basically how those parameters are created. But we obviously don’t have time because the big fancy models have often hundreds of millions of parameters, we don’t have time to try a hundred million sliders, so we need something better.

[00:31:11]

Loss Function for Model Evaluation

Well the first step is we need a better idea of when I move it, is it getting better or is it getting worse? So if you remember back to Arthur Samuel’s description of machine learning that we learned about in Chapter 1 of the book and in Lesson 1, we need something we can measure, which is a number that tells us how good is our model. And if we had that, then as we moved these sliders, we could check to see whether it’s getting better or worse.

Mean Squared Error (MSE) Loss Function

So this is called a loss function. So there’s lots of different loss functions you can pick, but perhaps the most simple and common is mean-squared error, which is going to be, so it’s going to get our predictions and it’s got the actuals, and we’re going to get predictions minus actuals squared and take the mean.

[00:32:01]

So that’s mean-squared error. So if I now rerun the exact same thing I had before, but this time I’m going to calculate the loss, the MSC, between the values that we predict, f of x, remember where f is the quadratic we created, and the actuals, y. And this time I’m going to add a title to our function, which is the loss. So now let’s do this more rigorously. We’re starting at a mean-squared error of 11.46. So let’s try moving this to the left and see if it gets better. No, worse. So move it to the right. Somewhere around there.

[(Unidentified)]

Now let’s try this one.

[Jeremy Howard]

Best when I go to the right. What about c? 3.91. It’s getting worse.

[00:33:01]

So I keep going. Somewhere about there. And so now we can repeat that process.

Manual Optimization with Loss Function

So we’ve had each of A, B and C move a little bit, let’s go back to A, can I get any better than 3.28? Let’s try moving left. Yeah, left was a bit better. And for B, let’s try moving left. Worse.

[(Unidentified)]

Right was better. And finally C, move to the right. Definitely better.

[Jeremy Howard]

There we go. Okay, so that’s a more rigorous approach. It’s still manual, but at least we don’t have to rely on us to recognise does it look better or worse. So finally, we’re going to automate this.

Automating Optimization with Derivatives

So the key thing we need to know is for each parameter, when we move it up, does the loss get better?

[00:34:00]

Or when we move it down, does the loss get better? One approach would be to try it. We could manually increase the parameter a bit and see if the loss improves and vice versa. But there’s a much faster way.

Gradient Descent for Parameter Optimization

And the much faster way is to calculate its derivative. So if you’ve forgotten what a derivative is, no problem, there’s lots of tutorials out there. You can go to Khan Academy or something like that. But in short, the derivative is what I just said. The derivative is a function that tells you if you increase the input, does the output increase or decrease, and by how much. That’s called the slope or the gradient. Now the good news is, PyTorch can automatically calculate that for you. So if you went through horrifying months of learning derivative rules in year 11 and are worried you’re going to have to remember them all again, don’t worry, you don’t. You don’t have to calculate any of this yourself, it’s all done for you, watch this.

[00:35:04]

Calculating Gradients in PyTorch

So the first thing to do is we need a function that takes the coefficients of the quadratic a, b, and c as inputs. I’m going to put them all on a list, you’ll see why in a moment, I kind of call them parameters. We create a quadratic passing in those parameters a, b, and c. This star on the front is a very, very common thing in Python. Basically it takes these parameters and spreads them out to turn them into a, b, and c and pass each of them to the function. So we’ve now got a quadratic with those coefficients. And then we return the mean squared error of our predictions against our actions. So this is a function that’s going to take the coefficients of a quadratic and return the loss. So let’s try it. So if we start with a, b, and c of 1.5, we get a mean squared error of 11.46. It looks a bit weird, it says it’s a tensor, so don’t worry about that too much.

[00:36:10]

In short, in PyTorch, everything is a tensor. A tensor just means it doesn’t just work with numbers, it also works with lists or vectors of numbers, that’s called a 1D tensor. Rectangles of numbers, so tables of numbers, that’s called a 2D tensor. Layers of tables of numbers, that’s called a 3D tensor, and so forth. So in this case, this is a single number, but it’s still a tensor. That means it’s just wrapped up in the PyTorch machinery that allows it to do things like calculate derivatives. But it’s still just the number 11.46. So what I’m going to do is I’m going to create my parameters, a, b, and c, and I’m going to put them all in a single 1D tensor.

Creating a Tensor with Gradient Calculation

A 1D tensor is also known as a rank-1 tensor.

[00:37:01]

So this is a rank-1 tensor. And it contains the list of numbers 1.5, 1.5, 1.5. And then I’m going to tell PyTorch that I want you to calculate the gradient for these numbers whenever we use them in a calculation. And the way we do that is we just say requiresRank. So here is our tensor, it contains 1.5 three times. And it also tells us, we flagged it to say please calculate gradients for this particular tensor when we use it in calculations. So let’s now use it in a calculation. We’re going to pass it to that quadMSE, that’s the function we just created that gets the MSE, the mean squared error, for a set of coefficients. And not surprisingly, it’s the same number we saw before, 11.46. Not very exciting, but there is one thing that’s very exciting.

[00:38:00]

Backpropagation and Gradient Calculation

It has added an extra thing to the end called gradFunction. And this is the thing that tells us that if we wanted to, PyTorch knows how to calculate the gradients for our inputs. And to tell PyTorch, yes please, go ahead and do that calculation, you call backward on the result of your loss function. Now when I run it, nothing happens, or it doesn’t look like nothing happens. But what does happen is it’s just added an attribute called grad, which is the gradient to our inputs, ABC. So if I run the cell, this tells me that if I increase A, the loss will go down. If I increase B, the loss will go down a bit less. If I increase C, the loss will go down. Now we want the loss to go down. So that means we should increase A, B and C. How much by? Well given that A says if you increase A even a little bit, the loss improves a lot, that suggests we’re a long way away from the right answer.

[00:39:07]

So we should probably increase this one a lot, this one the second most, and this one the third most.

Adjusting Parameters Based on Gradients

So this is saying when I increase this parameter, the loss decreases. So in other words, we want to adjust our parameters A, B and C by the negative of these. We want to increase, increase, increase. So we can do that by saying let’s take our ABC, minus equals, so that means equals ABC-, the gradient. But we’re just going to decrease it a bit, we don’t want to jump too far, so we’re just going to go a small distance. So we’re just going to somewhat arbitrarily pick .01. So that is now going to create a new set of parameters which are going to be a little bit bigger than before, because we subtracted negative numbers.

[00:40:01]

Automating Gradient Descent Optimization

We can now calculate the loss again. So remember before it was 11.46, so hopefully it’s going to get better, yes it did, 10.11. There’s one extra line of code which we didn’t mention, which is withTorch.nograd. Remember earlier on we said that the parameter ABC requires grad, and that means PyTorch will automatically calculate its derivative when it’s used in a function. Here it’s being used in a function, but we don’t want the derivative of this, this is not our loss. This is us updating the gradients. So this is basically the standard inner path of a PyTorch loop, and every neural net deep learning, pretty much every machine learning model, at least of this style, that you build basically looks like this. If you look deep inside fast.ai source code, you’ll see something that basically looks like this.

[00:41:05]

Gradient Descent Loop in PyTorch

So we could automate that, right? So let’s just take those steps, which is we’re going to calculate the mean squared error for our quadratic, call backward, and then subtract the gradient times a small number from the gradient. So let’s do it 5 times. So far we’re up to a loss of 10.1, so we’re going to calculate our loss, call .backward to calculate the gradients, and then with no grad, subtract the gradients times a small number and print how we’re going. And there we go, the loss keeps improving. So we now have some coefficients, and there they are, 3.2, 1.9, 2.0, so they’re definitely heading in the right direction.

[00:42:20]

Optimization and Gradient Descent

So that’s basically how we do, it’s called optimization. So you’ll hear a lot in deep learning about optimizers. This is the most basic kind of optimizer, but they’re all built on this principle, it’s called gradient descent. And you can see why it’s called gradient descent. We calculate the gradients and then do a descent, which is we’re trying to decrease the loss. So, believe it or not, that’s the entire foundations of how we create those parameters. So we need one more piece, which is what is the mathematical function that we’re finding parameters for.

[00:43:01]

Rectified Linear Unit (ReLU) Function

We can’t just use quadratics, because it’s pretty unlikely that the relationship between parameters and whether a pixel is part of a basic bound is a quadratic, it’s going to be something much more complicated. No problem. It turns out that we can create an infinitely flexible function from this one tiny thing. This is called a rectified linear unit.

Plotting the ReLU Function

The first piece I’m sure you’ll recognize, it’s a linear function. We’ve got our output y, our input x, and coefficients m and b. This is even simpler than our quadratic, and this is a line. Then torch.clip is a function that takes the output y and if it’s greater than that number, it turns it into that number. So in other words, this is going to take anything that’s negative and make it zero.

[00:44:03]

So this function is going to do two things. Calculate the output of the line, and if it’s smaller than zero, it will make it zero. So that’s rectified linear. So let’s use partial to take that function and set m and b to 1 and 1, so this function here will be y equals x plus 1, followed by this torch.clip. And here’s the shape. As you would expect, it’s a line until it gets under zero, it becomes a horizontal line.

Interactive ReLU Function with Sliders

So we can now do the same thing. We can take this plot function and make it interactive using interact. We can see what happens when we change these two parameters, m and b. So we’re now plotting the rectified linear and fixing m and b.

[00:45:02]

So m is the slope and b is the shift up and down. So that’s how those work.

Double ReLU Function

Now why is this interesting? Well it’s not interesting of itself. But what we could do is we could take this rectified linear function and create a double remu, which adds up two rectified linear functions together. So there’s some slope m1 b1, some second slope m2 b2, we’re going to calculate it at some point x. So let’s take a look at what that function looks like if we plot it. You can see what happens is we get this downward slope and then a hook and then an upward slope. So if I change m1, it’s going to change the slope of that first bit, and b1 is going to change its position.

[00:46:09]

And I’m sure you won’t be surprised to hear that m2 changes the slope of the second bit and b2 changes that location. Now this is interesting, why?

Arbitrarily Squiggly Functions with Multiple ReLU’s

Because we don’t just have to do a double remu, we can add as many remus together as we want. And if we add as many remus together as we want, then we can have an arbitrarily squiggly function and with enough remus we can match it as close as we want. So you could imagine an incredibly squiggly, I don’t know, like an audio waveform of me speaking, and if I gave you 100 million remus to add together, you could almost exactly match that. Now we want functions that are not just that we plot in 2D, we want things that can have more than one input, but you can add these together across as many dimensions as you like.

[00:47:10]

Multi-Dimensional ReLU Functions

And so exactly the same thing will give you a remu over surfaces, or a remu over 3D, 4D, 5D, and so forth. And it’s the same idea.

Constructing Arbitrarily Accurate Models

With this incredibly simple foundation, you can construct an arbitrarily accurate, precise model. The problem is, you need some numbers for them, you need parameters.

Gradient Descent for Parameter Optimization

Oh, no problem, we know how to get parameters, we use gradient descent.

Deriving Deep Learning

So believe it or not, we have just derived deep learning. Everything, from now on, is tweaks to make it faster and make it need less data.

[00:48:09]

This is it.

Deep Learning as “Drawing the Owl”

I remember a few years ago when I said something like this in a class, somebody on the forum was like, this reminds me of that thing about how to draw an owl. Jeremy is basically saying, step 1, draw two circles, step 2, draw the rest of the owl. The thing I find I have a lot of trouble explaining to students is when it comes to deep learning, there’s nothing between these two steps. When you have remus getting added together, and gradient descent to optimize the parameters, and samples of inputs and outputs that you want, the computer draws the owl. That’s it. So we’re going to learn about all these other tweaks, and they’re all very important.

Deep Learning as Function Fitting with Gradient Descent

But when you come down to trying to understand something in deep learning, just try to keep coming back to remind yourself of what it’s doing, which is using gradient descent to set some parameters to make a wiggly function, which is basically the addition of lots of rectified linear units, or something very similar to that, match your data.

[00:49:22]

Forum Questions and Answers

So we’ve got some questions on the forum. So a question from Zakia with 6 upvotes. So for those of you watching the video, what we do in the lesson is we want to make sure that the questions that you hear answered are the ones that people really care about. So we pick the ones that get the most upvotes. This question is, is there perhaps a way to try out all the different models and automatically find the best-performing one?

Automating Model Selection

Yes, absolutely you can do that.

[00:50:00]

So if we go back to our training script, remember there’s this thing called ListModels and it’s a list of strings. So you can easily add a for loop around this that basically goes for architecture in Tim.ListModels and you can do the whole lot, which would be like that. And then you could do that and away you go. It’s going to take a long time for 500 and something models. So generally speaking, I’ve never done anything like that myself. I would rather look at a picture like this and say, where am I? The vast majority of the time, this is something, this would be the biggest, I reckon number one mistake of beginners I see, is that they jump to these models from the start of a new project.

Importance of Starting with Simple Models

At the start of a new project, I pretty much only use ResNet-18 because I want to spend all of my time trying things out.

[00:51:07]

I’m going to try different data augmentation, I’m going to try different ways of cleaning the data, I’m going to try different external data that I can bring in. So I want to be trying lots of things and I want to be able to try it as fast as possible. So trying better architectures is the very last thing that I do. And what I do is once I’ve spent all this time and I’ve got to the point where I’ve got my ResNet-18, or maybe ResNet-34 because it’s nearly as fast, and I’m like how accurate is it, how fast is it, do I need it more accurate for what I’m doing, do I need it faster for what I’m doing, could I accept some trade-off to make it a bit slower, to make it more accurate, and so then I’ll have a look and say I kind of need to be somewhere around .001 seconds

[00:52:03]

and so I’ll try a few of these. So that would be how I would think about that.

Determining if You Have Enough Data

Next question from the forum is around how do I know if I have enough data? What are some signs that indicate my problem needs more data? I think it’s pretty similar to the architecture question. So you’ve got some amount of data. Presumably you’ve started using all the data that you have access to. You’ve built your model, you’ve done your best, is it good enough? Do you have the accuracy that you need for whatever it is you’re doing? You can’t know until you’ve trained the model, but as you’ve seen, it only takes a few minutes to train a quick model. So my very strong opinion is that the vast majority of projects I see in industry wait far too long before they train their first model.

[00:53:07]

Importance of Early Model Training

My opinion, you want to train your first model on day 1 with whatever CSV files or whatever that you can hack together. You might be surprised that none of the fancy stuff you’re thinking of doing is necessary because you already have a good enough accuracy for what you need. Or you might find quite the opposite. You might find that, oh my god, we’re basically getting no accuracy at all, maybe it’s impossible. These are things you want to know at the start, not at the end. We’ll learn lots of techniques both in this part of the course and in Part 2 about ways to really get the most out of your data.

Techniques for Data Augmentation and Semi-Supervised Learning

In particular, there’s a reasonably recent technique called semi-supervised learning which actually lets you get dramatically more out of your data. We’ve also started talking already about data augmentation, which is a classic technique you can use.

[00:54:00]

Importance of Labeled Data

So generally speaking, it depends how expensive it’s going to be to get more data, but also what do you mean when you say get more data, do you mean more labeled data? Often it’s easy to get lots of inputs and hard to get lots of outputs. For example, in medical imaging, where I’ve spent a lot of time, it’s generally super easy to jump into the radiology archive and grab more CT scans, but it might be very difficult and expensive to draw segmentation masks and pixel boundaries and so forth on them. So often you can get more, in this case images, or text, or whatever, and maybe it’s harder to get labels. Again, there’s a lot of stuff you can do, things like we’ll discuss semi-supervised learning to actually take advantage of unlabeled data as well.

Understanding Gradient Units and Learning Rate

Final question here, in the quadratic example where we calculated the initial derivatives for a, b, and c, we got values of minus 10.8, minus 2.4, etc.

[00:55:06]

What unit are these expressed in? Why don’t we adjust our parameters by these values themselves? I guess the question here is why are we multiplying it by a small number, which in this case is .01. Okay, let’s take those two parts of the question.

Gradient Units and Interpretation

What’s the unit here? The unit is for each increase in x of 1, for each increase in a of 1, so if I increase a from, in this case, 1.5, so if we increase from 1.5 to 2.5, what would happen to the loss? The answer is it would go down by 10.9887. Now that’s not exactly right because it’s kind of like in an infinitely small space, because actually it’s going to be curved.

[00:56:05]

If it stays, it stays at that slope, that’s what would happen. So if we increased b by 1, the loss would decrease, if the slope stayed the same, the loss would decrease by minus 2.122. So why would we not just change it directly by these numbers?

Why Multiply Gradients by a Small Number

The reason is that if we have some function that we’re fitting, and there’s some kind of interesting theory that says that once you get close enough to the optimal value, all functions look like quadratics anyway.

[00:57:02]

So we can kind of safely draw it in this kind of shape, because this is what they end up looking like if you get close enough. And let’s say we’re way out over here. So we’re measuring, I used my daughter’s favorite pens, the nice sparkly ones, so we’re measuring the slope here. That’s a very steep slope. So that seems to suggest we should jump a really long way. So we jump a really long way, and what happened? Well we jumped way too far. The reason is that that slope decreased as we moved along. And so that’s generally what’s going to happen, particularly as you approach the optimal, is generally the slope is going to decrease.

Learning Rate as a Hyperparameter

So that’s why we multiply the gradient by a small number. And that small number is a very, very, very important number.

[00:58:00]

It has a special name, it’s called the learning rate. This is an example of a hyperparameter. It’s not a parameter, it’s not one of the actual coefficients of your function, but it’s a parameter you use to calculate the parameters, it’s a hyperparameter. And so it’s something you have to pick. Now we haven’t picked any yet in any of the stuff we’ve done that I remember, and that’s because Fast.ai generally picks reasonable defaults for most things. But later in the course, we will learn about how to try and find really good learning rates. And you will find sometimes you need to actually spend some time finding a good learning rate.

Impact of Learning Rate on Optimization

You could probably understand the intuition here. If you pick a learning rate that’s too big, you’ll jump too far, and so you’ll end up way over here.

[00:59:02]

And then you will try to then jump back again, and you’ll jump too far the other way, and you’ll actually diverge. So if you ever see when your model is training that it’s getting worse and worse, it probably means your learning rate is too big. What would happen on the other hand if you pick a learning rate that’s too small, then you’re going to take tiny steps. And of course the flatter it gets, the smaller the steps it’s going to get. And so you’re going to get very, very bored. So finding the right learning rate is a compromise between the speed at which you find the answer and the possibility that you’re actually going to shoot past it and get worse and worse. So one of the bits of feedback I got quite a lot in the survey is that people want a break halfway through, which I think is a good idea.

Break Time

So I think now’s a good time to have a break. So let’s come back in 10 minutes at 25 past 7.

[01:00:14]

Okay, I hope you had a good rest, had a good break I should say.

Matrix Multiplication for Efficient Computation

So I want to now show you a really, really important mathematical computational trick, which is we want to do a whole bunch of values. So we’re going to be wanting to do a whole lot of mx plus b’s, and we don’t just want to do mx plus b, we’re going to want to have lots of variables. So for example, every single pixel of an image would be a separate variable, so we’re going to multiply every single one of those times some coefficient, and then add them all together, and then do the crop, the ReLU, and then we’re going to do it a second time with a second bunch of parameters, and then a third time, and a fourth time, and a fifth time.

[01:01:08]

It’s going to be pretty inconvenient to write out 100 million ReLU’s, but so it happens.

Matrix Multiplication as a Key Operation in Deep Learning

There’s a single mathematical operation that does all of those things for us, except for the final replace negatives with zeros, and it’s called matrix modification. I expect everybody at some point did matrix modification at high school, I suspect also a lot of you have forgotten how it works. When people talk about linear algebra in deep learning, they give the impression you need years of graduate school study to learn all this linear algebra. You don’t. Actually, all you need almost all the time is matrix modification, and it couldn’t be simpler.

Visualizing Matrix Multiplication

I’m going to show you a couple of different ways. The first is there’s a really cool site called matrixmodification.xyz, you can put in any matrix you want.

[01:02:05]

So this matrix is saying I’ve got 3 rows of data with 3 variables, so maybe they’re tiny images of 3 pixels, and the value of the first one is 121, the second is 011, and the third is 231. So those are our 3 rows of data. These are our 3 sets of coefficients. So we’ve got a, b, and c in our data, so I guess we’d call it x1, x2, and x3. And here’s our first set of coefficients a, b, and c, 2, 6, and 1. And then our second set is 5, 7, and 8. So here’s what happens when we do matrix modification. That second matrix here of coefficients gets flipped around and we do, this is the multiplications and additions that I mentioned, so multiply, add, multiply, add, multiply, add.

[01:03:05]

So that’s going to give you the first number, because that is the left-hand column of the second matrix times the first row, so that gives you the top-left result. So the next one is going to give us 2 results. So we’ve got now the right-hand one with the top row, and the left-hand one with the second row. Keep going down, keep going down, and that’s it. That’s what matrix modification is, it’s modifying things together and adding them up. So there would be one more step to do to make this a layer of a neural network, which is if this had any negatives, replace them with zeros. That’s why matrix modification is the critical foundational mathematical operation in basically all of deep learning.

[01:04:00]

GPU Tensor Cores and Matrix Multiplication

So the GPUs that we use, the thing that they are good at is this, matrix multiplication. They have special cores called tensor cores, which can basically only do one thing, which is to multiply together two 4×4 matrices. And then they do that lots of times with bigger matrices. So I’m going to show you an example of this.

Building a Machine Learning Model in a Spreadsheet

We’re actually going to build a complete machine learning model on real data in a spreadsheet. Fast.ai has become kind of famous for a number of things, and one of them is using spreadsheets to create deep learning models.

Fast.ai and Spreadsheet Deep Learning

We haven’t done it for a couple of years, so I’m pretty pumped to show this to you. What I’ve done is I went over to Kaggle, where there’s a competition I actually helped create many years ago called Titanic, and it’s like an ongoing competition, so 14,000 people have entered it so far.

[01:05:13]

Titanic Kaggle Competition

It’s just a competition for a bit of fun, there’s no end date. And the data for it is the data about who survived and who didn’t from the real Titanic disaster. And so I clicked here on the download button to grab it on my computer, that gave me a CSV, which I opened up in Excel. The first thing I did then was I just removed a few columns that clearly were not going to be important. Things like the name of the passengers, the passenger ID, just to try to make it a bit simpler. So I’ve ended up with each row of this is one passenger, the first column is the dependent variable.

[01:06:02]

Titanic Data and Dependent Variable

The dependent variable is the thing we’re trying to predict, did they survive. And the remaining are some information such as what class of the boat, 1st, 2nd or 3rd class, their sex, their age, how many siblings in the family, so you should always look for a data dictionary to find out what number of parents and children, what was their fare, and which of the 3 cities did they embark on, Sherbrooke, Queenstown, Southend. So there’s that data. Now when I first grabbed it, I noticed that there were some people with no age. Now there’s all kinds of things we could do for that, but for this purpose, I just decided to remove them. And I found the same thing for embarked, I removed the blanks as well.

[01:07:03]

But that left me with nearly all of the data. So then I’ve put that over here, here’s our data with those rows removed.

Data Preparation and Feature Engineering

So these are the columns that came directly from Kaggle. So basically what we now want to do is we want to multiply each of these by a coefficient. How do you multiply the word male by a coefficient? And how do you multiply s by a coefficient? You can’t. So I converted all of these to numbers.

Converting Categorical Variables to Numbers

Male and female are very easy. I created a column called IsMale. And as you can see, there’s just an if statement that says if sex is male, then it’s 1, otherwise it’s 0. And we can do something very similar for embarked. We can have one column called Did They Embark in Southampton, same deal, and another column for Did They Embark in Cherbourg.

[01:08:12]

And their p-class is 1, 2 or 3, which is a number, but it’s not really a continuous measurement of something. There isn’t 1 or 2 or 3 things, they’re different levels. So I decided to turn those into similar things, these are called binary categorical variables. So are they first class and are they second class? So that’s all that. The other thing that I was thinking, well then I kind of tried it and checked out what happened. What happened was, I created some random numbers, so to create the random numbers I just went equals, rand, and I copied those to the right.

[01:09:01]

Initializing Parameters with Random Numbers

And then I just went copy and I went paste values. So that gave me some random numbers. Before I said a, b and c, let’s just start them at 1.5, 1.5, 1.5, what we do in real life is we start our parameters at random numbers that are a bit more or a bit less than 0. So these are random numbers, actually I slightly lied, I didn’t use rand, I used rand-0.5. That way I got small numbers that were on either side of 0. So then when I took each of these and I multiplied them by our fares and ages and so forth, what happened was that these numbers here are way bigger than these numbers here.

Normalizing Data for Consistent Scale

And so in the end, all that mattered was what was their fare, because they were just bigger than everything else.

[01:10:00]

So I wanted everything to basically go from 0 to 1, these numbers were too big. So what I did up here is I just grabbed the maximum of this column, the maximum of all the fares is 512, actually I’ll do age first, I’ll do maximum of age, because it’s a similar thing, there’s 80-year-olds and there’s 2-year-olds. And so over here I just did, what’s their age divided by the maximum, and so that way all of these are between 0 and 1, just like all of these are between 0 and 1. So that’s how I fix, this is called normalizing the data. We haven’t done any of these things when we’ve done stuff with Fast.ai, that’s because Fast.ai does all of these things for you, and we’ll learn about how. But all these things are being done behind the scenes.

[01:11:01]

For fare, I did something a bit more, which is I noticed there’s lots of very small fares and there’s also a few very big fares, like $70 and then $7, $7.

Log Transformation for Skewed Data

Generally speaking when you have lots of really big numbers and a few small ones, so generally speaking when you’ve got a few really big numbers and lots of really small numbers, this is really common with money. Money kind of follows this relationship where a few people have lots of it and they spend huge amounts of it and most people don’t have heaps. If you take the log of something that has that kind of extreme distribution, you end up with something that’s much more evenly distributed. So I’ve added this here called logfare, as you can see. And these are all around 1, which isn’t bad. I could have normalized that as well, but I was too lazy, I didn’t bother. So at this point you can now see that if we start from here, all of these are all around the same kind of level, so none of these columns are going to saturate the others.

[01:12:12]

Ensuring Consistent Data Scale

So now I’ve got my coefficients, which as I said are just random, and so now I need to basically calculate ax1 plus bx2 plus cx3 plus blah blah blah blah blah blah.

Calculating Linear Model Predictions

And so to do that, you can use some product in Excel, I could have typed it out by hand, it would be very boring, but some product is just going to multiply each of these. This one will be multiplied by this one, this one will be multiplied by this one, and so forth, and then they get all added together. Now one thing, if you’re eagle-eyed, you might be wondering, is in a linear equation, we have y equals mx plus b.

[01:13:02]

Constant Term in Linear Equations

At the end, there’s this constant term. And I do not have any constant term, I’ve got something here called const, but I don’t have any plus at the end. How’s that working? Well there’s a nice trick that we pretty much always use in machine learning, which is to add a column of data just containing the number 1 every time.

Adding a Constant Column for Bias

If you have a column of data containing the number 1 every time, then that parameter becomes your constant term. So you don’t have to have a special constant term, and so it makes our code a little bit simpler when you do it that way. It’s just a trick, but everybody does it. So this is now the result of our linear model.

Regression Model Predictions and Loss Calculation

I’m not even going to do ReLU, I’m just going to do a plain regression. Now if you’ve done regression before, you might have learned about it as something you can solve with various matrix things.

[01:14:01]

But in fact, you can solve a regression using gradient descent. So I’ve just gone ahead and created a loss for each row. And so the loss is going to be equal to our prediction minus whether they survived squared. So this is going to be our squared error, and there they all are, our squared errors. And so here I’ve just summed them up. I could have taken the mean, I guess that would have been a bit easier to think about, but sum is going to give us the same result. So here’s our loss. And so now we need to optimize that using gradient descent.

Using Excel Solver for Gradient Descent Optimization

So Microsoft Excel has a gradient descent optimizer in it called Solver. So I’ll pick Solver. And it will say what are you trying to optimize, it’s this one here, and I’m going to do it by changing these cells here, and I’m trying to minimize it.

[01:15:02]

And so we’re starting at a loss of 55.78. Actually, let’s change it to mean as well. So start at 1.03, optimize that, and there we go, it’s gone from 1.03 to .1. We can check the predictions. So the first one, it predicted exactly correctly, it didn’t survive and would predict it wouldn’t survive. Ditto for this one, it’s very close. You can start to see a few issues here, like sometimes it’s predicting less than 0, and sometimes it’s predicting more than 1.

[01:16:02]

Wouldn’t it be cool if we had some way of constraining it to between 0 and 1, and that’s an example of some of the things we’re going to learn about that make this stuff work a little bit better. But you can see it’s doing an okay job.

Regression vs. Neural Network

This is not deep learning, this is not a neural net yet, this is just a regression. So to make it into a neural net, we need to do it multiple times.

Creating a Two-Layer Neural Network

So I’m just going to do it twice. So now rather than one set of coefficients, I’ve got two sets. Again, I just put in random numbers. Other than that, all the data is the same. So now I’m going to have my sum product again, so the first sum product is with my first set of coefficients, and my second sum product is with my second set of coefficients. So I’m just calling them linear1 and linear2. Now there’s no point adding those up together, because if you add up two linear functions together, you get another linear function.

[01:17:02]

Importance of Non-Linearity in Neural Networks

We want to get all those wiggles, right? So that’s why we have to do our round view.

Implementing ReLU in Excel

So in Microsoft Excel, round view looks like this. If the number is less than 0, use 0, otherwise use the number. So that’s how we’re going to replace the negatives with zeros. And then finally, if you remember from our spreadsheet, we have to add them together.

Adding Layers and Calculating Predictions

So we add the values together. So that’s going to be our prediction. And then our loss is the same as the other sheet, it’s just survived minus prediction squared. And let’s change that to average. Okay, so let’s try solving that.

Optimizing the Neural Network with Solver

Optimize, ah1, and this time we’re changing all of those.

[01:18:02]

Solve. So this is using gradient descent. Excel Solve is not the fastest at all, but it gets the job done. Okay, let’s see how we went. .08 for our deep learning model versus .1 for our regression. So it’s a bit better. So there you go. So we’ve now created our first deep learning neural network from scratch.

Deep Learning in Microsoft Excel

And we did it in Microsoft Excel, everybody’s favorite artificial intelligence tool. So that was a bit slow and painful. It would be a bit faster and easier if we used matrix multiplication, so let’s finally do that.

Matrix Multiplication for Efficient Neural Network Calculation

So this next one is going to be exactly the same as the last one, but with matrix multiplication. So all our data looks the same. You’ll notice the key difference now is our parameters have been transposed. So before I had the parameters matching the data in terms of being in columns.

[01:19:02]

For matrix multiplication, the expectation is the way matrix multiplication works is that you have to transpose this. So the rows and columns are the opposite way around.

Transposing Parameters for Matrix Multiplication

Other than that, it’s the same. I just copied and pasted the random numbers so we have exactly the same starting point. So now this entire thing here is a single function, which is matrix multiply all of this by all of this.

Matrix Multiplication for Neural Network Predictions

So when I run that, it fills in exactly the same numbers.

[(Unidentified)]

So now we can optimize that, make that a minimum by changing these.

[01:20:08]

It should get the same number.

[Jeremy Howard]

Matrix Multiplication as a Single Operation

So that’s just another way of doing the same thing. So you can see that matrix multiplication, it takes a surprisingly long time, at least for me, to get an intuitive feel for matrix multiplication as like a single mathematical operation. So I still find it helpful to kind of remind myself it’s just doing these sum products and additions. That is a deep learning neural network in Microsoft Excel.

[01:21:03]

Titanic Kaggle Competition as a Learning Exercise

The Titanic Kaggle competition, by the way, is a pretty fun learning competition. If you haven’t done much machine learning before, then it’s certainly worth trying out just to get the feel for how these all get put together.

Chapter 4 of the Book and Course Differences

So the chapter of the book that this lesson goes with is Chapter 4. Chapter 4 of the book is the chapter where we lose the most people because, to be honest, it’s hard. But part of the reason it’s hard is I couldn’t put this into a book. So we’re teaching it a very different way in the course to what’s in the book. You can use the two together, but if you’ve tried to read the book and been a bit disheartened, try following through the spreadsheet instead.

[01:22:04]

Maybe if you use Numbers or Google Sheets or something, you can try to create your own version of it on whatever spreadsheet platform you prefer.

Creating Your Own Spreadsheet or Python Implementation

Or you can try to do it yourself from scratch in Python if you want to really test yourself. So there’s some suggestions.

Forum Question about Dummy Variables

Question from Victor Guerrero. In the Excel exercise, when Jeremy is doing some feature engineering, he comes up with two new columns, pclass1 and pclass2. That is true, pclass1 and pclass2. Why is there no pclass3 column?

Explanation of Dummy Variables for Categorical Variables

Is it because if pclass1 is 0 and pclass2 is 0, then pclass3 must be 1?

[01:23:00]

So in a way, two columns are enough to encode the input of the original column? Yes, that’s exactly the reason. So there’s no need to tell the computer about things it can figure out for itself. These are called dummy variables. So when you create dummy variables for a categorical variable with 3 levels, like this one, you need 2 dummy variables. So in general, a categorical variable with n levels needs n-1 columns. I think that’s a good question.

Preview of Next Lesson: Natural Language Processing

So what we’re going to be doing in our next lesson is looking at natural language processing. So far we’ve looked at some computer vision and just now we’ve looked at some tabular data, so kind of spreadsheet type data. Next up we’re going to be looking at natural language processing.

Getting Started with NLP for Absolute Beginners Notebook

So I’ll give you a taste of it, so you might want to open up the Getting Started with NLP for Absolute Beginners notebook.

[01:24:06]

So here’s the Getting Started with NLP for Absolute Beginners notebook. I will say as a notebook author, it may sound a bit lame, but I always see when people have upvoted it, it always makes me really happy and it also helps other people find it.

Upvoting Notebooks and Providing Feedback

So remember to upvote these notebooks or any other notebooks you like. I also always read all the comments, so if you want to ask any questions or make any comments, I enjoy those as well.

Introduction to Natural Language Processing

So natural language processing is about, rather than taking image data and making predictions, we take text data. That text data most of the time is in the form of prose, plain English text.

English as the Dominant Language in NLP

English is the most common language used for NLP. There’s NLP models in dozens of different languages nowadays.

[01:25:05]

If you’re a non-English speaker, you’ll find that for many languages there’s less resources in non-English languages.

Opportunity to Contribute NLP Resources in Other Languages

There’s a great opportunity to provide NLP resources in your language. This has actually been one of the things that the Fast.ai community has been fantastic at in the global community, is building NLP resources. For example, the first Farsi NLP resource was created by a student from the very first Fast.ai course. In the Indic languages, some of the best resources have come out of Fast.ai alumni and so forth. That’s a particularly valuable thing you could look at. So if your language is not well represented, that’s an opportunity, not a problem.

NLP Applications: Classification, Sentiment Analysis, Author Identification, Legal Discovery, Document Organization

So some examples of things you could use NLP for.

[01:26:01]

Perhaps the most common and practically useful in my opinion is classification. Classification means you take a document. Now when I say a document, that could just be one or two words. It could be a book. It could be a Wikipedia page. It could be any length. We use the word document, it sounds like that’s a specific kind of length, but it can be a very short thing or a very long thing. We take a document and we try to figure out a category for it. Now that can cover many, many different kinds of applications. So one common one that we’ll look at a bit is sentiment analysis. So for example, is this movie review, positive or negative? Sentiment analysis is very helpful in things like marketing and product development. In big companies there’s lots and lots of information coming in about your product. It’s very nice to be able to quickly sort it out and track metrics from week to week. Something like figuring out what author wrote the document would be an example of a classification exercise because you’re trying to put a category, in this case it’s which author.

[01:27:01]

I think there’s a lot of opportunity in legal discovery. There’s already some products in this area where in this case the category is, is this legal document in scope or out of scope in the court case. Just organizing documents, triaging inbound emails, which part of the organization should it be sent to, is it urgent or not, stuff like that. So these are examples of classification.

Similarities between NLP and Image Classification

What you’ll find is when we look at classification tasks in NLP, it’s going to look very, very similar to images.

Using Hugging Face Transformers Library

But what we’re going to do is we’re going to use a different library. The library we’re going to use is called Hugging Face Transformers rather than Fast.ai. And there’s two reasons for that.

Reasons for Using Hugging Face Transformers

The main reason why is because I think it’s really helpful to see how things are done in more than one library. And Hugging Face Transformers, Fast.ai has a very layered architecture, so you can do things at a very high level with very little code, or you can dig deeper and deeper and deeper, getting more and more fine-grained.

[01:28:10]

Hugging Face Transformers doesn’t have the same high-level API at all that Fast.ai has, so you have to do more stuff manually. So at this point of the course, we’re going to actually intentionally use a library which is a little bit less user-friendly in order to see what extra steps you have to go through to use other libraries.

Quality of Hugging Face Transformers Library

Having said that, the reason I picked this particular library is that it is particularly good. It has really good models in it, it has a lot of really good techniques in it. Not at all surprising because they have hired lots and lots of Fast.ai alumni, so they have very high-quality people working on it. So, before the next lesson, if you’ve got time, take a look at this notebook and take a look at the data.

[01:29:03]

Data for Next Lesson: Concept Similarity

The data we’re going to be working with is quite interesting. It’s from a Kaggle competition which is trying to figure out in patterns whether two concepts are referring to the same thing or not, whether those concepts are represented as English text.

Concept Similarity as a Classification Task

When you think about it, that is a classification task because the document is basically text1, blah, text2, blah, and then the category is similar or not similar. In fact, in this case, they actually have scores. It’s either going to be 0, 0.25, 0.5, 0.75, or 1, like how similar is it. But it’s basically a classification task when you think of it that way. You can have a look at the data.

Preview of Next Lesson: Validation Sets and Metrics

Next week, we’re going to go step-by-step through this notebook.

[01:30:00]

Conclusion and Next Week’s Lesson

We’re going to take advantage of that as an opportunity also to talk about the really important topics of validation sets and metrics, which are two of the most important topics in not just deep learning, but machine learning more generally. Thanks everybody, I’ll see you next week.