Lesson 5: Practical Deep Learning for Coders 2022

[Jeremy Howard]

Introduction to Lesson 5

Okay, hi everybody and welcome to Practical Deep Learning for Coders Lesson 5. We’re at a stage now where we’re going to be getting deeper and deeper into the details of how these networks actually work. Last week we saw how to use a slightly lower level library than Fast.ai, being Hugging Faced Transformers, to train a pretty nice NLP model, and today we’re going to be going back to tabular data and we’re going to be trying to build a tabular model actually from scratch.

Building Tabular Models from Scratch

We’re going to build a couple of different types of tabular models from scratch. So the problem that I’m going to be working through is the Titanic problem, which if you remember back a couple of weeks, is the data set that we looked at on Microsoft Excel, and it has each row is one passenger on the Titanic. So this is a real world data set, historic data set, tells you both if that passenger survived, what class they were on in the ship, their sex, age, how many siblings, how many other family members, how much they spent in the fare, and whereabouts they embarked, one of three different cities.

[00:01:15]

Titanic Problem Revisited

And you might remember that we built a linear model, we then did the same thing using matrix multiplication, and we also created a very, very simple neural network. You know Excel can do nearly everything we need, as you saw, to build a neural network, but it starts to get unwieldy, and so that’s why people don’t use Excel for neural networks in practice, instead we use a programming language like Python. So what we’re going to do today is we’re going to do the same thing with Python.

Linear Model and Neural Net from Scratch Notebook

So we’re going to start working through the linear model and neural net from scratch notebook, which you can find on Kaggle or on the course repository.

[00:02:10]

And today what we’re going to do is we’re going to work through the one in the clean folder. So both for fastbook, the book, and course 22, these lessons, the clean folder contains all of our notebooks, but without any pros or any outputs.

Jupyter Notebook Environment

So here’s what it looks like when I open up the linear model and neural net from scratch in Jupyter. What I’m using here is paper space gradient, which, as I mentioned a couple of weeks ago, is what I’m going to be doing most things in. That looks a little bit different to the normal paper space gradient, because the default view for paper space gradient, at least as I do this course, is their rather awkward notebook editor, which at first glance has the same features as the real Jupyter notebooks and JupyterLab environments, but in practice are actually missing lots of things.

[00:03:25]

Paper Space Gradient

So this is the normal paper space. So remember, you have to click this button, right? And the only reason you might keep this window running is then you might go over here to the machine to remind yourself when you close the other tab to click stop machine. If you’re using the free one, it doesn’t matter too much. And also when I started, I make sure I’ve got something to shut down automatically in case I forget. So other than that, we can stay in this tab. And because Jupyter, this is JupyterLab that runs.

[00:04:02]

And you can always switch over to classic Jupyter notebook if you want to. So given that they’ve kind of got tabs inside tabs, I normally maximize it at this point. And it’s really helpful to know the keyboard shortcuts. So control shift, square bracket, right and left, switch between tabs.

Clean Version of Notebook

That’s one of the key things to know about. Okay. So I’ve opened up the clean version of the linear model and neural net from scratch notebook. And so remember when you go back through the video kind of the second time or through the notebook a second time, this is generally what you want to be doing is going through the clean notebook. And before you run each cell, try to think about like, oh, what did Jeremy say? Why are we doing this? What output would I expect? Make sure you get the output you would expect.

[00:05:00]

And if you’re not sure why something is the way it is, try changing it and see what happens. And then if you’re still not sure, well, why did that thing not work the way I expect, you know, search the forum, see if anybody’s asked that question before. And you can ask the question on the forum yourself if you’re still not sure. So as I think we’ve mentioned briefly before, I find it really nice to be able to use the same notebook both on Kaggle and off Kaggle.

Kaggle Environment Check

So most of my notebooks start with basically the same cell, which is something that just checks whether we’re on Kaggle. So Kaggle sets an environment variable. So we can just check for it. And that way we know if we’re on Kaggle. And so then if we are on Kaggle, you know, a notebook that’s part of a competition will already have the data downloaded and unzipped for you. Otherwise, if I haven’t downloaded the data before, then I need to download it and unzip it. So Kaggle is a pip installable module. So you would type pip install Kaggle.

[00:06:02]

If you’re not sure how to do that, you should check our deep dive lessons to see exactly the steps. But roughly speaking, you can use your console, pip install, and whatever you want to install.

Installing Kaggle Module

Or as we’ve seen before, you can do it directly in a notebook by putting an exclamation mark at the start. So that’s going to run not Python, but a shell command. Okay. So that’s enough to ensure that we have the data downloaded and a variable called path that’s pointing at it. Most of the time we’re going to be using at least PyTorch and NumPy. So we import those so that they’re available to Python. And when we’re working with tabular data, as we’ve talked about before, we’re generally also going to want to use Pandas. And it’s really important that you’re somewhat familiar with the kind of basic API of these three libraries.

[00:07:06]

And I’ve recommended Wes McKinney’s book before, particularly for these ones. One thing, just by the way, is that these things tend to assume you’ve got a very narrow screen, which is really annoying because it always wraps things. So if you want to put these three lines as well, then it just makes sure that everything is going to use up the screen properly.

Reading Data with Pandas

Okay. So as we’ve seen before, you can read a common separated values file with Pandas. And you can take a look at the first few lines and the last few lines and how big it is. And so here’s the same thing as our spreadsheet. Okay. So there’s our data from the spreadsheet. And here it is as a data frame. So if we go data frame dot is NA, that returns a new data frame in which every column, it tells us whether or not that particular value is NAN.

[00:08:10]

Handling Missing Values

So NAN is not a number. And the most common reason you get that is because it was missing. Okay. So a missing value is obviously not a number. So we, in the Excel version, we did something you should never usually do. We deleted all the rows with missing data. Just because in Excel, it’s a little bit harder to work with. In Pandas, it’s very easy to work with. First of all, we can just sum up what I just showed you. Now, if you call sum on a data frame, it sums up each column. So you can see that there’s kind of some small foundational concepts in Pandas, which when you put them together, take you a long way. So one idea is this idea that you can call a method on a data frame, and it calls it on every row.

[00:09:02]

And then you can call a reduction on that, and it reduces each column. And so now we’ve got the total, and in Python, and Pandas, and NumPy, and PyTorch, you can treat a Boolean as a number, and true will be 1, false will be 0. So this is the number of missing values in each column. So we can see that Cabin, out of 891 rows, it’s nearly always empty. Age is empty a bit of the time.

Imputing Missing Values with Mode

Embarked is almost never empty. So if you remember from Excel, we need to multiply a coefficient by each column. That’s how we create a linear model. So how would you multiply a coefficient by a missing value? You can’t. There’s lots of ways of, it’s called imputing missing values, so replacing missing value with a number. The easiest which always works is to replace missing values with the mode of a column. The mode is the most common value.

[00:10:01]

That works both for categorical variables, it’s the most common category, and continuous variables, it’s the most common number. So you can get the mode by calling df.mode. One thing that’s a bit awkward is that if there’s a tie for the mode, so there’s more than one thing that’s the most common, it’s going to return multiple rows, so I need to return the zeroth row. So here is the mode of every column. So we can replace the missing values for age with 24, and the missing values for cabin with b96, b98, and embarked with s.

Pandas Methods and Google Search

I’ll just mention in passing, I am not going to describe every single method we call and every single function we use. And that is not because you’re an idiot if you don’t already know them. Nobody knows them all.

[00:11:00]

But I don’t know which particular subset of them you don’t know. So let’s assume, just to pick a number at random, that the average fast.ai student knows 80% of the functions we call. Then I could tell you what every function is, in which case 80% of the time I’m wasting your time because I already know. Or I could pick 20% of them at random, in which case I’m still not helping because most of the time it’s not the ones you don’t know. My approach is that for the ones that are pretty common, I’m just not going to mention it at all because I’m assuming that you’ll Google it, right? So it’s really important to know, so for example, if you don’t know what iloc is, that’s not a problem. It doesn’t mean you’re stupid, right? It just means you haven’t used it yet and you should Google it, right? So I mentioned in this particular case, you know, this is one of the most important pandas methods because it gives you the row located at this index, i for index and loc for location, so this is the zeroth row.

[00:12:04]

But yeah, I did kind of go through things a little bit quickly on the assumption that students, fast.ai students, are, you know, proactive, curious people. And if you’re not a proactive, curious person, then you could either decide to become one for the purpose of this course or maybe this course isn’t for you. All right, so a data frame has a very convenient method called fillna and that’s going to replace the not a numbers with whatever I put here.

Filling Missing Values with fillna

And the nice thing about pandas is it kind of has this understanding that columns match to columns. So it’s going to take the mode from each column and match it to the same column in the data frame and fill in those missing values. Normally, that would return a new data frame. Many things, including this one in pandas, have an inplace argument that says actually modify the original one.

[00:13:01]

And so if I run that, now if I call .isna.sum, they’re all zero. So that’s like the world’s simplest way to get rid of missing values.

Simple Imputation Methods

Okay, so why did we do it the world’s simplest way? Because, honestly, this doesn’t make much difference most of the time. And so I’m not going to spend time the first time I go through and build a baseline model doing complicated things when I don’t necessarily know that I need complicated things. And so imputing missing values is an example of something that most of the time, this dumb way, which always works without even thinking about it, will be quite good enough, you know, for nearly all the time. So we keep things simple where we can. John, question.

[John]

Question on Imputation Assumptions

Jeremy, we’ve got a question on this topic.

[00:14:00]

Javier is commenting on the assumption involved in substituting with the mode. And he’s asking, in your experience, what are the pros and cons of doing this versus, for example, discarding cabin or age as fields that we even train the model?

[Jeremy Howard]

Importance of Data and Feature Engineering

Yeah, so I would certainly never throw them out, right? There’s just no reason to throw away data. And there’s lots of reasons to not throw away data. So, for example, when we use the fast.ai library, which we’ll use later, one of the things it does, which is actually a really good idea, is it creates a new column for everything that’s got missing values, which is Boolean, which is, did that column have a missing value for this row? And so maybe it turns out that cabin being empty is a great predictor. So, yeah, I don’t throw out rows and I don’t throw out columns. Okay. So it’s helpful to understand a bit more about our data set and a really helpful, I’ve already imported this, a really helpful, you know, quick method.

[00:15:09]

Data Exploration with describe

And again, it’s kind of nice to know like a few quick things you can do to get a picture of what’s happening in your data is describe. And so describe, you can say, okay, describe all the numeric variables. And that gives me a quick sense of what’s going on here. So we can see survive clearly is just 0s and 1s, because all of the quartiles are 0s and 1s. It looks like p class is 1, 2, 3. What else do we see? Fair is an interesting one, right? Lots of smallish numbers and one really big number, so probably long tailed. So it’s, yeah, good to have a look at this to see what’s going on for your numeric variables. So as I said, fair looks kind of interesting. To find out what’s going on there, I would generally go with a histogram.

[00:16:00]

Histograms and Long Tail Distributions

So if you can’t quite remember what a histogram is, again, Google it. But in short, it shows you for each amount of fair, how often does that fair appear? And it shows me here that the vast majority of fairs are less than $50. But there’s a few right up here to 500. So this is what we call a long tail distribution, a small number of really big values and lots of small ones. There are some types of model which do not like long tail distributions. Linear models are certainly one of them. And neural nets are generally better behaved without them as well. Luckily, there’s an almost surefire way to turn a long tail distribution into a more reasonably centered distribution, and that is to take the log.

Log Transformation for Long Tail Distributions

We use logs a lot in machine learning. For those of you that haven’t touched them since year 10 math, it would be a very good time to like go to Khan Academy or something and remind yourself about what logs are and what they look like, because they’re actually really, really important.

[00:17:12]

But the basic shape of the log curve causes it to make, you know, really big numbers less really big and doesn’t change really small numbers very much at all. So if we take the log, now log of 0 is nan. So a useful trick is to just do log plus 1. And in fact, there is a log p1 if you want to do that. It does the same thing. So if we look at the histogram of that, you can see it’s much more, you know, it’s more sensible now. It’s kind of centered and it doesn’t have this big long tail. So that’s pretty good. So we’ll be using that column in the future. As a rule of thumb, stuff like money or population, things that kind of can grow exponentially, you very often want to take the log of.

[00:18:05]

So if you have a column with a dollar sign on it, that’s a good sign. It might be something to take the log of.

Categorical Variables and Dummy Variables

So there was another one here, which is we had a numeric, which actually doesn’t look numeric at all. It looks like it’s actually categories. So pandas gives us a dot unique. And so we can see, yep, they’re just 1, 2, and 3 are all the levels of p class. That’s their first class, second class, or third class. We can also describe all the non-numeric variables. And so we can see here that, not surprisingly, names are unique, because the count of names is the same as count unique. There’s two sexes, 681 different tickets, 147 different cabins, and three levels of embarked. So we cannot multiply the letter S by a coefficient, or the word male by a coefficient.

[00:19:16]

So what do we do? What we do is we create something called dummy variables.

Creating Dummy Variables with get_dummies

Dummy variables are, and we can just go get dummies, a column that says, for example, is sex female? Is sex male? Is p class 1? Is p class 2? Is p class 3? So for every possible level of every possible categorical variable, it’s a Boolean column of did that row have that value of that column. So I think we’ve briefly talked about this before, that there’s a couple of different ways we can do this. One is that for an n level categorical variable, we could use n levels, in which case we also need a constant term in our model.

[00:20:05]

Pandas, by default, shows all n levels, although you can pass an argument to change that if you want. Here we are, drop first. I kind of like having all of them sometimes, because then you don’t have to put in a constant term, and it’s a bit less annoying, and it can be a bit easier to interpret, but I don’t feel strongly about it either way.

Drop First Argument in get_dummies

Okay, so here’s a list of all of the columns that Pandas added. I guess, strictly speaking, I probably should have automated that, but never mind. I just copied and posted them. And so here are a few examples of the added columns. N Unix, Pandas, lots of things like that. Head means the first few rows, or the first few lines. So five by default in Pandas. So here you can see they’re never both male and female, they’re never neither, they’re always one or the other.

[00:21:07]

All right, so with that now we’ve got numbers, which we can multiply by coefficients.

Ignoring Name Column for Now

It’s not going to work for name, obviously, because we’d have 891 columns, and all of them would be unique. So we’ll ignore that for now. That doesn’t mean it’s have to always ignore it. And in fact, something I did do, something I did do on the forum topic, because I made a list of some nice Titanic notebooks that I found, and quite a few of them really go hard on this name column.

Feature Engineering with Name Column

And in fact, one of them, yeah, this one, in what I believe is, yes, Chris Diot’s first ever Kaggle notebook.

[00:22:00]

He’s now the number one ranked Kaggle notebook person in the world. So this is a very good start. He got a much better score than any model that we’re going to create in this course, using only that column name. And basically, yeah, he came up with this simple little decision tree by recognizing, you know, all of the information that’s in a name column. So yeah, we don’t have to treat, you know, a big string of letters like this as a random big string of letters. We can use our domain expertise to recognize that things like Mr. have meaning, and that people with the same surname might be in the same family, and actually figure out quite a lot from that. But that’s not something I’m going to do. I’ll let you look at those notebooks if you’re interested in the feature engineering. And I do think that they’re very interesting, so do check them out.

[00:23:01]

Our focus today is on building a linear model on a neural net from scratch, not on tabular feature engineering, even though that’s also a very important subject.

Matrix Multiplication and Element-wise Multiplication

Okay. So we talked about how matrix modification makes linear models much easier. And the other thing we did in Excel was element-wise multiplication. Both of those things are much easier if we use PyTorch instead of plain Python.

PyTorch for Linear Models

Or we could use NumPy. But I tend to just stick with PyTorch when I can, because it’s easier to learn one library than two. So I just do everything in PyTorch. I almost never touch NumPy nowadays. They’re both great. But they do everything each other does, except PyTorch also does differentiation and GPUs, so why not just learn PyTorch? So to turn a column into something that I can do PyTorch calculations on, I have to turn it into a tensor.

[00:24:06]

Tensors and Broadcasting

So a tensor is just what NumPy calls an array. It’s what mathematicians would call either a vector or a matrix, or once you go to higher ranks, mathematicians and physicists just call them tensors. In fact, this idea originally in computer science came from a notation developed in the 50s called APL, which was turned into a programming language in the 60s by a guy called Ken Iverson. And Ken Iverson actually came up with this idea from, he said, his time doing tensor analysis in physics. So these areas are very related. So we can turn the survived column into a tensor, and we’ll call that tensor our dependent variable.

Independent and Dependent Variables

That’s the thing we’re trying to predict. Okay. So now we need some independent variables. So our independent variables are age, siblings.

[00:25:06]

That one is, oh yeah, a number of other family members. The log of fare that we just created, plus all of those dummy columns we added. And so we can now grab those values and turn them into a tensor. And we have to make sure they’re floats. We want them all to be the same data type, and PyTorch wants things to be floats if you’re going to multiply things together. So there we are. And so one of the most important attributes of a tensor, probably the most important attribute is its shape, which is how many rows does it have, and how many columns does it have.

Tensor Shape and Rank

The length of the shape is called its rank.

[00:26:05]

That’s the rank of the tensor. It’s the number of dimensions or axes that it has. So a vector is rank one. A matrix is rank two. A scalar is rank zero. And so forth. I try not to use too much jargon, but there’s some pieces of jargon that are really important, because otherwise you’re going to have to say the length of the shape again and again. It’s much easier to say rank. So we’ll use that word a lot. So a table is a rank two tensor. Okay, so we’ve now got the data in good shape. Here’s our independent variables, and we’ve got our dependent variable.

Multiplying Coefficients by Data

So we can now go ahead and do exactly what we did in Excel, which is to multiply our rows of data by some coefficients.

[00:27:09]

And remember, to start with, we create random coefficients.

Random Coefficients Initialization

So we’re going to need one coefficient for each column. Now, in Excel, we also had a constant, but in our case now, we’ve got every column, every level in our dummy variable, so we don’t need a constant. So the number of coefficients we need is equal to the shape of the independent variables, and it’s the index one element. That’s the number of columns. That’s how many coefficients we want. So we can now ask PyTorch to give us some random numbers, n coef of them. They’re between zero and one. So if we subtract a half, then they’ll be centered. And there we go. Before I do that, I set the seed.

Setting Random Seed

What that means is, in computers, computers in general cannot create truly random numbers.

[00:28:09]

Instead, they can calculate a sequence of numbers that behave in a random-like way. That’s actually good for us, because often in my teaching, I like to be able to say, you know, in the prose, oh, look, that was two, now it’s three, or whatever. And if I was using really random numbers, then I couldn’t do that, because it would be different each time. So this makes my results reproducible. That means if you run it, you’ll get the same random numbers as I do, by saying start the pseudo-random sequence with this number. I’ll mention in passing, a lot of people are very, very into reproducible results. They think it’s really important to always do this. I strongly disagree with that.

Reproducibility and Understanding Data Variation

In my opinion, an important part of understanding your data is understanding how much it varies from run to run.

[00:29:04]

So if I’m not teaching, and wanting to be able to write things about these pseudo-random numbers, I almost never use a manual seed. Instead, I like to run things a few times and get an intuitive sense of, like, oh, this is, like, very, very stable. Or, oh, this is all over the place. Getting an intuitive understanding of how your data behaves and your coolest lines of code you’ll ever see. I know it doesn’t look like much, but think about what it’s doing.

Matrix-Vector Product and Broadcasting

Yeah, that’ll do. Okay, so we’ve multiplied a matrix by a vector. Now, that’s pretty interesting. Now, mathematicians amongst you will know that you can certainly do a matrix-vector product, but that’s not what we’ve done here.

[00:30:04]

At all. We’ve used element-wise multiplication. So normally, if we did the element-wise multiplication of two vectors, it would multiply, you know, element one with element one, element two with element two, and so forth, and create a vector of the same size output. But here, we’ve done a matrix times a vector. How does that work? This is using the incredibly powerful technique of broadcasting. And broadcasting, again, comes from APL, a notation invented in the 50s and a programming language developed in the 60s. And it’s got a number of benefits. Basically, what it’s going to do is it’s going to take each coefficient and multiply them in turn by every row in our matrix. So if you look at the shape of our independent variable and the shape of our coefficients, you can see that each one of these coefficients can be multiplied by each of these 891 values in turn.

[00:31:15]

And so the reason we call it broadcasting is it’s as if this is 891 rows by 12 columns. It’s as if this was broadcast 891 times. It’s as if we had a loop looping 891 times and doing coefficients times row 0, coefficients times row 1, coefficients times row 0, 2, and so forth, which is exactly what we want. Now, reasons to use broadcasting.

Benefits of Broadcasting

Obviously, the code is much more concise. It looks more like math rather than clunky programming with lots of boilerplate. So that’s good. Also, that broadcasting all happened in optimized C code.

[00:32:02]

And in fact, if it’s being done on a GPU, it’s being done in optimized GPU assembler. It’s going to run very, very fast indeed. And this is the trick of why we can use a so-called slow language like Python to do very fast big models is because a single line of code like this can run very quickly on optimized hardware on lots and lots of data. The rules of broadcasting are a little bit subtle and important to know.

NumPy Broadcasting Rules

And so I would strongly encourage you to Google NumPy broadcasting rules and see exactly how they work. But, you know, the kind of intuitive understanding of them hopefully you’ll get pretty quickly, which is generally speaking, you can kind of, as long as the last axes match, it’ll broadcast over those axes.

[00:33:04]

You can broadcast a rank 3 thing with a rank 1 thing or, you know, most simple version would be tensor 1, 2, 3 times 2. So broadcast a scalar over a vector. That’s exactly what you would expect. So it’s copying effectively that 2 into each of these spots, multiplying them together. But it doesn’t use up any memory to do that.

Virtual Copying in Broadcasting

It’s kind of a virtual copying, if you like. So this line of code, independence by coefficients, is very, very important. And it’s the key step that we wanted to take, which is now we know exactly how, what happens when we multiply the coefficients in.

Adding Coefficients Together

And if you remember back to Excel, we did that product, and then in Excel there’s a sum product.

[00:34:10]

We then added it all together, because that’s what a linear model is. It’s the coefficients times the values added together. So we’re now going to need to add those together. But before we do that, if we did add up this row, you can see that the very first value has a very large magnitude, and all the other ones are small.

Normalizing Data for Optimization

Same with row 2, same with row 3, same with row 4. What’s going on here? Well, what’s going on is that the very first column was age. And age is much bigger than any of the other columns. It’s not the end of the world, but it’s not ideal, right? Because it means that a coefficient of, say, 0.5 times age means something very different to a coefficient of, say, 0.5 times log fare, right?

[00:35:03]

And that means that that random coefficient we start with, it’s going to mean very different things for different columns, and that’s going to make it really hard to optimize. So we would like all the columns to have about the same range. So what we could do, as we did in Excel, is to divide them by the maximum. So the maximum, so we did it for age, and we also did it for fare in this case. I didn’t use log. So we can get the max of each row by calling .max, and you can pass in a dimension.

PyTorch Dimensions and Axes

Do you want the maximum of the rows or the maximum of the columns? We want the maximum over the rows. So we pass in dimension 0. So those different parts of the shape are called either axes or dimensions. PyTorch calls them dimensions. So that’s going to give us the maximum of each row.

[00:36:01]

And if you look at the docs for PyTorch’s max function, it’ll tell you it returns two things, the actual value of each maximum and the index of which row it was. We want the values. So now, thanks to broadcasting, we can just say take the independent variables and divide them by the vector of values. Again, we’ve got a matrix and a vector. And so this is going to do an element-wise division of each row of this divided by this vector. Again, in a very optimized way. So if we now look at our normalized independent variables by the coefficients, you can see they’re all pretty similar values.

Normalizing Data with Maximum or Standard Deviation

So that’s good. There’s lots of different ways of normalizing, but the main ones you’ll come across is either dividing by the maximum or subtracting the mean and dividing by the standard deviation. It normally doesn’t matter too much.

[00:37:02]

Because I’m lazy, I just pick the easier one. And being lazy and picking the easier one is a very good plan, in my opinion. So now that we can see that multiplying them together is working pretty well, we can now add them up.

Adding Coefficients and Predictions

And now we want to add up over the columns. And that would give us predictions. Now, obviously, just like in Excel, when we started out, they’re not useful predictions because they’re random coefficients, but they are predictions nonetheless. And here’s the first 10 of them. So then remember, we want to use gradient descent to try to make these better.

Gradient Descent and Loss Function

So to do gradient descent, we need a loss, right? The loss is the measure of how good or bad are these coefficients. My favorite loss function as a kind of, like, don’t think about it, just chuck something out there, is the mean absolute value.

Mean Absolute Value Loss Function

And here it is.

[00:38:01]

Torch dot absolute value of the error, difference, take the mean. And often stuff like this, you’ll see people will use prewritten mean absolute error functions, which is also fine. But I quite like to write it out, because I can see exactly what’s going on. No confusion. No chance of misunderstanding. So those are all the steps I’m going to need to create coefficients, um, run a linear model, and get its loss.

Creating Functions for Calculations

So what I like to do in my notebooks, like, not just for teaching, but all the time, is to, like, do everything step by step manually, and then just copy and paste the steps into a function. So here’s my calc preds function, is exactly what I just did, right? Here’s my calc loss function, exactly what I just did. Um, and that way, you know, I, a lot of people, like, go back and delete all their explorations, or they, like, do them in a different notebook, or they’re, like, working in an IDE, they’ll go and do it in some, you know, line-oriented REPL, whatever.

[00:39:11]

But if you, you know, think about the benefits of keeping it here. When you come back to it in six months, you’ll see exactly why you did what you did and how we got there, or if you’re showing it to your boss or your colleague, you can see, you know, exactly what’s happening, what does each step look like. I think this is really very helpful indeed. I know not many people code that way, but I feel strongly that it’s a huge productivity win to individuals and teams. So remember from our gradient descent from scratch, that the one bit we don’t want to do from scratch is calculating derivatives, because it’s just menial and boring.

Calculating Derivatives with requiresGrad

So to get PyTorch to do it for us, you have to say, well, what things do you want derivatives for? And, of course, we want it for the coefficients. So then we have to say requiresGrad.

[00:40:00]

And remember, very important, in PyTorch, if there’s an underscore at the end, that’s an in-place operation. So this is actually going to change coefs. It also returns them, right, but it also changes them in place. So now we’ve got exactly the same numbers as before, but with requiresGrad turned on. So now when we calculate our loss, that doesn’t do any other calculations, but what it does store is a gradient function. It’s the function that Python has remembered that it would have to do to undo those steps to get back to the gradient. And to say, oh, please actually call that backward gradient function, you call backward.

Backward Gradient Function and .grad Attribute

And at that point, it sticks into a .grad attribute, the coefficients, gradients. So this tells us that if we increased the age coefficient, the loss would go down. So therefore, we should do that, right?

[00:41:04]

So since negative means increasing, this would decrease the loss.

Updating Coefficients with Gradient Descent

That means we need to, if you remember back to the gradient descent from scratch notebook, we need to subtract the coefficients times the learning rate. So we haven’t got any particular ideas yet of how to set the learning rate. So for now, I just pick a, just try a few and still find out what works best. In this case, I found .1 worked pretty well. So I now subtract. So again, this is sub underscore. So subtract in place from the coefficients, the gradient times the learning rate.

Learning Rate and Loss Reduction

And so the loss has gone down. That’s great. From .54 to .52. So there is one step. So we’ve now got everything we need to train a linear model.

[00:42:05]

Training a Linear Model

So let’s do it. Now, as we discussed last week, to see whether your model’s any good, it’s important that you split your data into training and validation.

Splitting Data into Training and Validation Sets

For the Titanic data set, it’s actually pretty much fine to use a random split. Because back when my friend Mark Eaton and I actually created this competition for Kaggle many years ago, that’s basically what we did, if I remember correctly. So we can split them randomly into a training set and a validation set. So we’re just going to use Fast.ai for that. There’s, you know, it’s very easy to do it manually with NumPy or PyTorch. You can use scikit-learn’s chain test split. I’m using Fast.ai’s here, partly because it’s easy just to remember one way to do things, and this works everywhere. And partly because in the next notebook, we’re going to be seeing how to do more stuff in Fast.ai. So I want to make sure we have exactly the same split.

[00:43:04]

Fast.ai for Data Splitting

So those are a list of the indexes of the rows that will be, for example, in the validation set. That’s why I call it validation split. So to create the validation independent variables, you have to use those to index into the independent variables. And ditto for the dependent variables. And so now we’ve got our independent variable training set and our validation set, and we’ve also got the same for the dependent variables. So like I said before, I normally take stuff that I’ve already done in a notebook, seems to be working, and put them into functions.

Creating Functions for Training Steps

So here’s the step which actually updates coefficients. So let’s chuck that into a function. And then the steps that go cut, last, stop backward, update coefficients, and then print the loss, we’ll chuck that in one function.

[00:44:01]

So just copying and pasting stuff into cells here. And then the bit on the very top of the previous section that got the random numbers, minus 0.5 requires grad, chuck that in the function. So here we’ve got something that initializes coefficients, something that does one epoch by updating coefficients. So we can put that together into something that trains the model for n epochs with some learning rate by setting the manual seed, initializing the coefficients, doing one epoch in a loop, and then return the coefficients.

Training the Model with train_model Function

So let’s go ahead and run that function. So it’s printing at the end of each one the loss. And you can see the loss going down from 0.53, down, down, down, down, down, to a bit under 0.3. So that’s good. We have successfully built and trained a linear model on a real data set. I mean, it’s a Kaggle data set. But it’s important to, like, not underestimate how real Kaggle data sets are.

[00:45:04]

Real World Data Sets and Kaggle

They’re real data. And this one’s a playground data set. So it’s not like anybody actually cares about predicting who survived the Titanic, because we already know. But it has all the same features of, you know, different data types and missing values and normalization and so forth. So, you know, it’s a good, it’s a good playground. So it’d be nice to see what the coefficients are attached to each variable.

Examining Coefficients

So if we just zip together the independent variables and the coefficients, and we don’t need the grad anymore, and create a dict of that, there we go. So it looks like older people had less chance of surviving. That makes sense. Males had less chance of surviving. Also makes sense. So it’s good to kind of eyeball these and check that they seem reasonable.

[00:46:02]

Now the metric for this Kaggle competition is not mean absolute error.

Accuracy Metric for Kaggle Competition

It’s accuracy. Now, of course, we can’t use accuracy as a loss function, because it doesn’t have a sensible gradient, really. But we should measure accuracy to see how we’re doing, because that’s going to tell us how we’re going against the thing that the Kaggle competition cares about. So we can calculate our predictions. And we’ll just say, okay, well, many times the prediction’s over 0.5, we’ll say that’s predicting survival. So that’s our predictors of survival. This is the actual in a validation set. So if they’re the same, then we predicted it correctly.

Calculating Accuracy

So here’s, are we right or wrong for the first 16 rows? We’re right more often than not. So if we take the mean of those, remember true equals one, then that’s our accuracy.

[00:47:00]

So we were right about 79% of the time. So that’s not bad. Okay, so we’ve successfully created something that’s actually predicting who survived the Titanic.

Creating an Accuracy Function

That’s cool, from scratch. So let’s create a function for that, an accuracy function that just does what I showed. And there it is. Now, I’ll say another thing, like, you know, my, a weird coding thing for me, you know, weird as in not that common is I use less comments than most people because all of my code lives in notebooks.

Code Comments and Notebooks

And of course, in the real version of this notebook is full of pros, right? So when I’ve taken people through a whole journey about what I’ve built here and why I’ve built it and what intermediate results are and check them along the way, the function itself in my, you know, for me doesn’t need extensive comments. You know, I’d rather explain the thinking of how I got there and show examples of how to use it and so forth.

[00:48:07]

Okay. Now, here’s the first few predictions we made.

Sigmoid Function for Binary Dependent Variables

And some of the time we’re predicting negatives for survival and greater than one for survival. Which doesn’t really make much sense, right? People either survived one or they didn’t zero. It would be nice if we had a way to automatically squish everything between zero and one. That’s going to make it much easier to optimize. The optimizer doesn’t have to try hard to hit exactly one or hit exactly zero, but it can just like try to create a really big number to mean survived or a really small number to mean perished. Here’s a great function.

[00:49:03]

Sigmoid Function for Squishing Values

Here’s a function that as I increase, let’s make it even bigger range, as my numbers get beyond four or five, it’s asymptoting to one. And on the negative side, as they get beyond negative four or five, they asymptote to zero. Or to zoom in a bit. But then around about zero, it’s pretty much a straight line. This is actually perfect. This is exactly what we want. So here is the equation, one over one plus a to the negative minus x. And this is called the sigmoid function. By the way, if you haven’t checked out sym pi before, definitely do so. This is the symbolic Python package, which can do, it’s kind of like Mathematica or Wolfram style symbolic calculations, including the ability to plot symbolic expressions, which is pretty nice.

[00:50:13]

SymPy for Symbolic Calculations

PyTorch already has a sigmoid function. I mean, it just calculates this, but it does it in an optimized way. So what if we replaced calc preds? Remember, before calc preds was just this. What if we took that and then put it through a sigmoid? So calc preds will now basically, the bigger this number is, the closer it’s going to get to one, and the smaller it is, the closer it’s going to get to zero. This should be a much easier thing to optimize, all of our values are in a sensible range. Now, here’s another cool thing about using Jupyter plus Python.

Dynamic Language and Redefining Functions

Python is a dynamic language.

[00:51:03]

Even though I called calc preds and train model calls one epoch, which calls calc loss, which calls calc preds, I can redefine calc preds now, and I don’t have to do anything. That’s now inserted into Python’s symbol table, and that’s the calc preds that train model will eventually call. So if I now call train model, that’s actually going to call my new version of calc preds. So that’s a really neat way of doing exploratory programming in Python. I wouldn’t, you know, release, you know, a library that redefines calc preds multiple times. You know, when I’m done, I would just keep the final version, of course. But it’s a great way to try things, as you’ll see.

Improved Optimization with Sigmoid

And so look what’s happened. I found I was able to increase the learning rate from .1 to 2.

[00:52:02]

It was much easier to optimize, as I guessed. And the loss has improved from .295 to .197. The accuracy has improved from .79 to .82, nearly .83. So as a rule, this is something that we’re pretty much always going to do when we have a binary dependent variable.

Importance of Sigmoid for Binary Dependent Variables

So a dependent variable that’s one or zero, is the very last step is chuck it through a sigmoid. Generally speaking, if you’re wondering why is my model with a binary dependent variable not training very well, this is the thing you want to check. Are you chucking it through a sigmoid, or is the thing you’re calling chucking it through a sigmoid or not? It can be surprisingly hard to find out if that’s happening. So, for example, with hugging face transformers, I actually found I had to look in their source code to find out.

[00:53:03]

And I discovered that something I was doing wasn’t. And didn’t seem to be documented anywhere. But it is important to find these things out.

Neural Net Architecture Details

As we’ll discuss in the next lesson, we’ll talk a lot about neural net architecture details. But the details we’ll focus on are what happens to the inputs at the very first stage, and what happens to the outputs at the very last stage. We’ll talk a bit about what happens in the middle, but a lot less. And the reason why is it’s the things that you put into the inputs that’s going to change for every single data set you do. And what do you want to happen to the outputs, which is going to happen for every different target that you’re trying to hit. So, those are the things that you actually need to know about.

Sigmoid Function and Fast.ai

So, for example, this thing of like, well, you need to know about the sigmoid function. And you need to know that you need to use it. Fast.ai is very good at handling this for you.

[00:54:01]

That’s why we haven’t had to talk about it much until now. If you say, oh, it’s a category block dependent variable, you know, it’s going to use the right kind of thing for you. But most things are not so convenient. John, is there a question?

[John]

Question on get_dummies and Test Data

Yes, there is. It’s back in the sort of the feature engineering topic. But a couple of people have liked it. So, I thought we’d put it out there. So, Shivam says, one concern I have while using get dummies. So, it’s in that get dummies phase. What happens while using test data? I have a new let’s say male, female, and other. And this will have an extra column missing from the training data. How do you take care of that?

[Jeremy Howard]

Handling New Categories in Test Data

That’s a great question. Yeah. So, normally, you’ve got to think about this pretty carefully and check pretty carefully. Unless you use fast.ai. So, fast.ai always creates an extra category called other.

[00:55:05]

And at test time, inference time, if you have some level that didn’t exist before, we put it into the other category for you. Otherwise, you basically have to do that yourself. Or at least check, you know. Generally speaking, it’s pretty likely that otherwise your extra level will be silently ignored, you know, because it’s going to be in the data set, but it’s not going to be matched to a column. So, yeah, it’s a good point and definitely worth checking.

Categorical Variables with Many Levels

For categorical variables with lots of levels, I actually normally like to put the less common ones into an other category. And, again, that’s something that fast.ai will do for you automatically. But, yeah, definitely something to keep an eye out for. Good question.

[00:56:02]

Submitting to Kaggle

Okay. So, before we take our break, we’ll just do one last thing, which is we will submit this to Kaggle, because I think it’s quite cool that we have successfully built a model from scratch. So, Kaggle provides us with a test.csv, which is exactly the same structure as the training CSV, except that it doesn’t have a survived column.

Test.csv and Data Consistency

Now, interestingly, when I tried to submit to Kaggle, I got an error in my code saying that, oh, one of my fairs is empty. So, that was interesting, because the training set doesn’t have any empty fairs. So, sometimes this will happen, that the training set and the test set have different things to deal with. So, in this case, I just said, oh, there’s only one row. I don’t care. So, I just replaced the empty one with a zero for fair. So, then I just copied and pasted the preprocessing steps from my training data frame and stuck them here for the test data frame and the normalization as well.

[00:57:06]

Preprocessing Test Data

And so, now I just call calc prets, is it greater than .5, turn it into a zero or one, because that’s what Kaggle expects, and put that into the survived column, which previously, remember, didn’t exist. So, then finally, I create a data frame with just the two columns, ID and survived, stick it in a CSV file, and then I can call the Unix command head, just to look at the first few rows. And if you look at the Kaggle competitions data page, you’ll see this is what the submission file is expected to look like.

Kaggle Submission and Results

So, that made me feel good. So, I went ahead and submitted it. I didn’t mention it. Okay. So, anyway, I submitted it, and I remember I got, like, I think I was basically right in the middle, about 50%, you know, better than half the people who have entered the competition, worse than half the people. So, you know, solid middle-of-the-pack result for a linear model from scratch, I think is a pretty good result.

[00:58:05]

So, that’s a great place to start.

Break

So, let’s take a 10-minute break. We’ll come back at 7.17 and continue on our journey. All right.

Welcome Back

Welcome back. You might remember from Excel that after we did the sum product version, we then replaced it with a matrix multiply.

Matrix Multiplication in PyTorch

Wait, not there, must be here. Here we are. With a matrix multiply. So, let’s do that step now. So, matrix times vector dot sum over axis equals 1 is the same thing as matrix multiply.

[00:59:05]

So, here is the times dot sum version.

Matrix Multiply Operator in Python

Now, we can’t use this character for a matrix multiply, because it means element-wise operation. All of the times plus minus divide in PyTorch and NumPy mean element-wise. So, corresponding elements. So, in Python, instead, we use this character. As far as I know, it’s pretty arbitrary. It’s one of the ones that wasn’t used. So, that is an official Python. It’s a bit unusual. It’s an official Python operator. It means matrix multiply. But Python doesn’t come with an implementation of it. So, because we’ve imported, because these are tensors, and in PyTorch, it’ll use PyTorches. And as you can see, they’re exactly the same. So, we can now just simplify a little bit what we had before. calc preds is now torch dot sigmoid of the matrix multiply.

[01:00:01]

Matrix Multiplication for Neural Networks

Now, there is one thing I’d like to move towards now is that we’re going to try to create a neural net in a moment. And so, that means rather than treat this as a matrix times a vector, I want to treat this as a matrix times a matrix, because we’re about to add some more columns of coefficients.

Changing init_coefs to Create a Matrix

So, we’re going to change init coefs, so that rather than creating an n coef vector, we’re going to create an n coef by one matrix. So, in math, we would probably call that a column vector, but I think that’s kind of a dumb name in some ways, because it’s a matrix, right? It’s a rank two tensor. So, the matrix multiply will work fine either way, but the key difference is that if we do it this way, then the result of the matrix multiply will also be a matrix.

[01:01:05]

It’ll be, again, an n rows by one matrix.

Dependent Variable as a Matrix

That means when we compare it to the dependent variable, we need the dependent variable to be an n rows by one matrix as well. So, effectively, we need to take the n rows long vector and turn it into an n rows by one matrix.

Adding a Trailing Dimension with None

So, there’s some useful, very useful, and at first maybe a bit weird, notation in PyTorch NumPy for this, which is if I take my training dependent variables vector, I index into it, and colon means every row, right? So, in other words, that just means the whole vector, right? It’s the same, basically, as that.

[01:02:01]

And then I index into a second dimension. Now, this doesn’t have a second dimension. So, there’s a special thing you can do, which is if you index into a second dimension with a special value none, it creates that dimension. So, this has the effect of adding an extra trailing dimension to train dependents.

Unit Axis and Matrix Shape

So, it turns it from a vector to a matrix with one column. So, if we look at the shape after that, as you see, it’s now got, we call this a unit axis. It’s now, if we train our model, we’ll get coefficients just like before, except that it’s now a column vector, also known as a rank two matrix with a trailing unit axis.

[01:03:15]

Expanding to Neural Networks

Okay, so that hasn’t changed anything. It’s just repeated what we did in the previous section, but it’s kind of set us up to expand. Because now that we’ve done this using matrix model play, we can go crazy, and we can go ahead and create a neural network.

Neural Network with Multiple Coefficients

So, with our neural network, remember back to the Excel days. Notice here, it’s the same thing, right? We created a column vector, but we didn’t create a column vector. We actually created a matrix with kind of two sets of coefficients. So, when we did our matrix multiply, every row gave us two sets of outputs. Which we then chuck through ReLU, right?

[01:04:02]

Which, remember, we just used an if statement. And we added them together. So, our coefs now, to make a proper neural net, we need one set of coefs here.

Initializing Coefficients for Hidden Layers

And so, here they are, torch.rand and coef by what? Well, in Excel, we just did two, because I kind of got bored of getting everything working properly. But you don’t have to worry about filling rash and creating columns and blah, blah, blah. In PyTorch, you can create as many as you like. So, I made something you can change. I called it nhidden, number of hidden activations. And I just set it to 20. And as before, we centralize them by making them go from minus 0.5 to 0.5. Now, when you do stuff by hand, everything does get more fiddly.

Fiddling with Constants and Learning Rate

If our coefficients aren’t, if they’re too big or too small, it’s not going to train at all.

[01:05:06]

Basically, the gradients still kind of vaguely point in the right direction, but you’ll jump too far or not far enough or whatever. So, I want my gradients to be about the same as they were before. So, I divide by nhidden, because otherwise at the next step when I add up the next matrix multiply, it’s going to be much bigger than it was before. So, it’s all very fiddly. So then, I want to take, so that’s going to give me, for every row, it’s going to give me 20 activations, 20 values, right?

Activations and Matrix Multiplication

Just like in Excel, we had two values, because we had two sets of coefficients. And so, to create a neural net, I now need to multiply each of those 20 things by a coefficient, and this time it’s going to be a column vector, because I want to create one output, predictor of survival.

[01:06:01]

Coefficients for Hidden to Output Layer

So, again, torch.rand, and this time the nhidden will be the number of coefficients by one. And again, like trying to find something that actually trains properly required me some fiddling around to figure out how much to subtract, and I found if I subtract 0.3, I could get it to train. And then finally, I didn’t need a constant term for the first layer, as we discussed, because our dummy variables have, you know, n columns rather than n-1 columns, but layer 2 absolutely needs a constant term, okay?

Constant Term for Second Layer

And we could do that, as we discussed last time, by having a column of ones, although in practice I actually find it’s just easier just to create a constant term, okay? So here is a single scalar random number. So those are the coefficients we need. One set of coefficients to go from input to hidden, one that goes from hidden to a single output, and a constant.

[01:07:03]

Initializing Coefficients for Neural Network

So they’re all going to need grab. And so now we can change how we calculate predictions.

Calculating Predictions with calc_preds

So we’re going to pass in all of our coefficients. So a nice thing in Python is if you’ve got a list or a tuple of values, on the left-hand side, you can expand them out into variables. So this is going to be a list of three things. So we’ll call them l1, layer 1, layer 2, and the constant term, because those are the list of three things we returned. So in Python, if you just chuck things with commas between them like this, it creates a tuple. A tuple is a list. It’s an immutable list. So now we’re going to grab those three things. So step one is to do our matrix multiply. And as we discussed, we then have to replace the negatives with zeros. And then we put that through our second matrix multiply, so our second layer, and add the constant term.

[01:08:01]

Neural Network Implementation

And remember, of course, at the end, chuck it through a sigmoid. So here is a neural network. Now update coefs previously subtracted the coefficients, the gradients times the learning rate from the coefficients.

Updating Coefficients for Multiple Layers

But now we’ve got three sets of those. So we have to just chuck that in a for loop. So change that as well. And now we can go ahead and train our model.

Training the Neural Network

Ta-da! We just trained a model. And how does that compare? So the loss function’s a little better than before. Accuracy, exactly the same as before. And, you know, I will say it was very annoying to get to this point, trying to get these constants right and find a learning rate that worked. It was super fiddly.

[01:09:00]

But, you know, we got there.

Comparing Linear Model and Neural Network

We got there. And it’s a very small test set. I don’t know if this is necessarily better or worse than the linear model, but it’s certainly fine. And I think that’s pretty cool that we were able to build a neural net from scratch. That’s doing pretty well. But I hear that all the cool kids nowadays are doing deep learning, not just neural nets.

Deep Learning with Multiple Hidden Layers

So we better make this deep learning. So this one only has one hidden layer. So let’s create one with n hidden layers.

Initializing Coefficients for Multiple Layers

So, for example, let’s say we want two hidden layers, 10 activations in each. You can put as many as you like here, right? So init coefs now is going to have to create a torch.rand for every one of those hidden layers. And then another torch.rand for your constant terms. Stick requires rand in all of them.

[01:10:01]

And then we can return that. So that’s how we can just initialize as many layers as we want of coefficients.

Matrix Multiplication for Multiple Layers

So the first one, the first layer, so the sizes of each one, the first layer will go from n coef to 10. The second matrix will go from 10 to 10. And the third matrix will go from 10 to 1. So it’s worth, like, working through these matrix multipliers on, like, a spreadsheet or a piece of paper or something to kind of convince yourself that there’s the right number of activations at each point. And so then we need to update calc preds so that rather than doing each of these steps manually, we now need to loop through all the layers, do the matrix multiply at the constant, and as long as it’s not the last layer, do the relu.

Activation Functions and Final Layer

Why not the last layer? Because remember, the last layer has sigmoid. So these things about, like, remember what happens on the last layer, this is an important thing you need to know about.

[01:11:07]

You need to kind of check if things aren’t working. What’s your, this thing here is called the activation function, torch.sigmoid and f.relu. They’re the activation functions for these layers. One of the most common mistakes amongst people trying to kind of create their own architectures, or kind of variants of architectures, is to mess up their final activation function. And that makes things very hard to train. So make sure we’ve got a torch.sigmoid at the end and no relu at the end.

Importance of Final Activation Function

So there’s our deep learning calc preds.

Deep Learning calc_preds Function

And then just one last change is now when we update our coefficients, we go through all the layers and all the constants.

Updating Coefficients for Multiple Layers

And again, there was so much messing around here with trying to find, like, exact ranges of random numbers that end up training okay.

[01:12:03]

But eventually I found some, and as you can see, it gets to about the same accuracy.

Experimenting with Code in Notebooks

This code is worth spending time with. And when the code’s inside a function, it can be a little difficult to experiment with. So, you know, what I would be inclined to do to understand this code is to kind of copy and paste this cell, make it so it’s not in a function anymore, and then use control shift dash to separate these out into separate cells, right? And then try to kind of set it up so you can run a single layer at a time or a single coefficient, like, make sure you can see what’s going on, okay? And that’s why we use notebooks, is so that we can experiment. And it’s only through experimenting like that, that at least for me, I find that I can really understand what’s going on.

[01:13:00]

Understanding Code through Experimentation

Nobody can look at this code and immediately say, I don’t think anybody can. I get it. That all makes perfect sense. But once you try running through it yourself, you’ll be like, oh, I see why that’s as it is. So, you know, one thing to point out here is that our neural nets and deep learning models didn’t particularly seem to help.

Deep Learning and Small Data Sets

So does that mean that deep learning is a waste of time and you just did five lessons that you shouldn’t have done? No, not necessarily. This is a playground competition. We’re doing it because it’s easy to get your head around. But for very small data sets like this with very, very few columns, and the columns are really simple, you know, deep learning is not necessarily going to give you the best result. In fact, as I mentioned, nothing we do is going to be as good as a carefully designed model that uses just the name column.

[01:14:15]

So, you know, I think that’s an interesting insight, right?

Deep Learning for Images and Text

Is that the kind of data types which have a very consistent structure, like, for example, images or natural language text documents, quite often you can somewhat brainlessly chuck a deep learning neural net at it and get a great result. Generally, for tabular data, I find that’s not the case.

Feature Engineering for Tabular Data

I find I normally have to think pretty long and hard about the feature engineering in order to get good results. But once you’ve got good features, you then want a good model. And so you, you know, and generally, like, the more features you have and the more levels in your categorical features and stuff like that, you know, the more value you’ll get from more sophisticated models.

[01:15:08]

But, yeah, I definitely would say an insight here is that, you know, you want to include simple baselines as well.

Importance of Simple Baselines

And we’re going to be seeing even more of that in a couple of notebooks time.

Why You Shouldn’t Build from Scratch

So we’ve just seen how you can build stuff from scratch. We’ll now see why you shouldn’t. I mean, I say you shouldn’t. You should to learn, but why you probably won’t want to in real life. When you’re doing stuff in real life, you don’t want to be fiddling around with all this annoying initialization stuff and learning rate stuff and dummy variable stuff and normalization stuff and so forth. Because we can do it for you. And it’s not like everything’s so automated that you don’t get to make choices.

[01:16:03]

But you want, like, you want to make the choice not to do things the obvious way and have everything else done the obvious way for you. So that’s why we’re going to look at this, why you should use a framework notebook.

Framework Notebook for Tabular Data

And again, I’m going to look at the clean version of it. And again, in the clean version of it, step one is to download the data as appropriate for the Kaggle or non-Kaggle environment and set the display options and set the random seed. And read the data frame. All right. Now, there was so much fussing around with the doing it from scratch version that I did not want to do any feature engineering.

Feature Engineering with Fast.ai

Because every column I added was another thing I had to think about, dummy variables and normalization and random coefficient initialization and blah, blah, blah. But with a framework, everything’s so easy, you can do all the feature engineering you want.

[01:17:02]

Because this isn’t a lesson about feature engineering.

Advanced Feature Engineering Tutorial

Instead, I plagiarized entirely from this fantastic advanced feature engineering tutorial on Kaggle. And what this tutorial found was that in addition to the log fair we’ve already done, that you can do cool stuff with the deck, with adding up the number of family members, whether people are traveling alone, how many people are on each ticket. And finally, we’re going to do stuff with a name, which is we’re going to grab the Mr., Miss, Mrs., Master, whatever. So we’re going to create a function to do some feature engineering.

Feature Engineering Function

And if you want to learn a bit of pandas, here’s some great lines of code to step through one by one. And again, take this out of a function, put them into individual cells, run each one, look up the tutorials.

Pandas Functions and Tutorials

What does str do?

[01:18:01]

What does map do? What does group by and transform do? What does value counts do? Part of the reason I put this here was for folks that haven’t done much of any pandas to have some, you know, examples of functions that I think are useful. And I actually refactored this code quite a bit to try to show off some features of pandas I think are really nice. So we’ll do the same random split as before, so passing in the same seed.

Fast.ai Tabular Model Data Set

And so now we’re going to do the same set of steps that we did manually with fast.ai. So we want to create a tabular model data set based on a pandas data frame. And here is the data frame. These are the train versus validation splits I want to use. Here’s a list of all the stuff I want done, please. Deal with dummy variables for me, deal with missing values for me, normalize continuous variables for me.

[01:19:04]

Preprocessing with Fast.ai

I’m going to tell you which ones are the categorical variables. So here’s, for example, preclass was a number, but I’m telling fast.ai to treat it as categorical. Here’s all the continuous variables. Here’s my dependent variable. And the dependent variable is a category. So create data loaders from that place. And save models right here in this directory. That’s it. That’s all the preprocessing I need to do, even with all those extra engineered features.

Creating a Learner

Create a learner. Okay, so this, remember, is something that contains a model and data. And I want you to put in two hidden layers with 10 units and 10 units, just like we did in our final example.

Learning Rate Finder

What learning rate should I use?

[01:20:00]

Make a suggestion for me, please. So call lr find. You can use this for any fast.ai model. Now, what this does is it starts at a learning rate that’s very, very small, 10 to the negative 7. And it puts in one batch of data, and it calculates the loss. And then it increases the learning rate slightly and puts through another batch of data. And it keeps doing that for higher and higher learning rates. And it keeps track of the loss as it increases the learning rate. Just one batch of data at a time. And what happens is, for the very small learning rates, nothing happens. But then once you get high enough, the loss starts improving. And then as it gets higher, it improves faster. Until you make the learning rate so big that it overshoots, and then it kills it. And so generally, somewhere around here is the learning rate you want.

Choosing a Learning Rate

Fast.ai has a few different ways of recommending a learning rate. You can look up the docs to see what they mean. I generally find if you choose slide and valley and pick one between the two, you get a pretty good learning rate.

[01:21:05]

So here we’ve got about 0.01 and about 0.08. So I picked 0.03. So just run a bunch of epochs.

Training the Model with Fast.ai

Away it goes. Ta-da! This is a bit crazy. After all that, we’ve ended up exactly the same accuracy as the last two models. That’s just a coincidence, right? I mean, there’s nothing particularly about that accuracy. And so at this point, we can now submit that to Kaggle.

Submitting to Kaggle with Fast.ai

Now, remember with the linear model, we had to repeat all of the pre-processing steps on the test set in exactly the same way? Don’t have to worry about it with Fast.ai. In Fast.ai, I mean, we still have to deal with the fill missing for fair, because that’s that.

Testdl Function for Inference Time Preprocessing

We have to add our feature engineering features. But all the pre-processing, we just have to use this one function called testdl.

[01:22:00]

That says create a data loader that contains exactly the same pre-processing steps that our learner used. And that’s it. That’s all you need. So just because you want to make sure that your inference time transformations pre-processing are exactly the same as the training time. So this is the magic method which does that. Just one line of code.

Getting Predictions with Fast.ai

And then to get your predictions, you just say get preds and pass in that data loader I just built. And so then these three lines of code are the same as the previous notebook. And we can take a look at the top, and as you can see, there it is. So how did that go? I don’t remember. No, I didn’t see. I think it was, again, basically middle of the pack, if I remember correctly.

[01:23:04]

So one of the nice things about now that it’s so easy to, like, add features and build models, is we can experiment with things much more quickly.

Experimenting with Ensembling

So I’m going to show you how easy it is to experiment with, you know, what’s often considered a fairly advanced idea, which is called ensembling.

Ensembling for Improved Predictions

There’s lots of ways of doing ensembling, but basically ensembling is about creating multiple models and combining their predictions. And the easiest kind of ensemble to do is just to literally just build multiple models. And so each one is going to have a different set of randomly initialized coefficients, and therefore each one’s going to end up with a different set of predictions. So I just create a function called ensemble, which creates a learner, exactly the same as before, fits, exactly the same as before, and returns the predictions.

[01:24:04]

Ensemble Function for Multiple Models

And so we’ll just use a list comprehension to do that five times. So that’s going to create a set of five predictions. Done. So now we can take all those predictions and stack them together and take the mean over the rows.

Combining Predictions with Mean

So that’s going to give us the, what’s actually, sorry, the mean over the, over the first dimension. So the mean over the sets of predictions. And so that will give us the average prediction of our five models. And again, we can turn that into a CSV and submit it to Kaggle.

Kaggle Submission with Ensemble

And that one, I think that went a bit better. Let’s check. Yeah. Okay. So that one actually finally gets into the top 20%, 25% in the competition.

[01:25:01]

So, I mean, not amazing by any means, but you can see that, you know, this simple step of creating five independently trained models, just starting from different starting points in terms of random coefficients, actually improved us from top 50% to top 25%. John.

[John]

Question on Ensemble Mode vs. Mean

Is there an argument, because you’ve got a categorical result, you’re 0, 1 effectively, is there an argument that you might use the mode of the ensemble rather than the numerical mean?

[Jeremy Howard]

Different Averaging Methods for Ensembles

I mean, yes, there’s an argument that’s been made. And, yeah, something I would just try. I generally find it’s less good, but not always. And I don’t feel like I’ve got a great intuition as to why. And I don’t feel like I’ve seen any studies as to why. You could predict, like there’s a few, there’s at least three things you could do, right?

[01:26:00]

You could take the, is it greater or less than 0.5 ones and zeros and average them. Or you could take the mode of them. Or you could take the actual probability predictions and take the average of those and then threshold that. And I’ve seen examples where certainly both of the different averaging versions, each of them has been better. I don’t think I’ve seen one where the mode’s better, but that was very popular back in the 90s. So, yeah, it’s so easy to try, you might as well give it a go.

Random Forests Notebook

Okay, we don’t have time to finish the next notebook, but let’s make a start on it. So the next notebook is random forests, how random forests really work.

Introduction to Random Forests

Who here has heard of random forests before?

[01:27:03]

Nearly everybody. Okay. So very popular. Developed, I think, initially in 1999, but, you know, gradually improved in popularity during the 2000s. I was like, everybody kind of knew me as Mr. Random Forests for years. I implemented them like a couple of days after the original technical report came out. I was such a fan. All of my early Kaggle results were random forests. I love them.

Elegance and Resilience of Random Forests

And I think hopefully you’ll see why I’m such a fan of them, because they’re so elegant and they’re almost impossible to mess up. A lot of people will say like, oh, why are you using machine learning?

Logistic Regression vs. Random Forests

Why don’t you use something simple like logistic regression? And I think like, oh gosh, in industry, I’ve seen far more examples of people screwing up logistic regression than successfully using logistic regression, because it’s very, very, very, very difficult to do correctly.

[01:28:07]

You know, you’ve got to make sure you’ve got the correct transformations and the correct interactions and the correct outlier handling and blah, blah, blah, and anything you get wrong, the entire thing falls apart. Random forests, it’s very rare that I’ve seen somebody screw up a random forest in industry.

Difficulty of Implementing Logistic Regression

They’re very hard to screw up, because they’re so resilient. And you’ll see why. So in this notebook, just by the way, rather than importing numpy and pandas and matplotlib and blah, blah, blah, there’s a little handy shortcut, which is if you just import everything from fastai.imports, that imports all the things that you normally want.

Importing from fastai.imports

So, I mean, it doesn’t do anything special, but it just saves some messing around. So again, we’ve got our cell here to grab the data.

[01:29:03]

Preprocessing Data for Random Forests

And I’m just going to do some basic preprocessing here with my fill in a for the fare. I only need it for the test set, of course. Grab the modes and do the fill in a on the modes. Take the log fare. And then I’ve got a couple of new steps here, which is converting embarked insects into categorical variables.

Converting Categorical Variables to Codes

What does that mean? Well, let’s just run this on both the data frame and the test data frame. Split things into categories and continuous. And sex is a categorical variable. So let’s look at it. Well, that’s interesting. It looks exactly the same as before, male and female. But now it’s got a category, and it’s got a list of categories. What’s happened here?

[01:30:00]

Well, what’s happened is pandas has made a list of all of the unique values of this field. And behind the scenes, if you look at the cat codes, you can see behind the scenes, it’s actually turned them into numbers. It looks up this one into this list to get male. Looks up this zero into this list to get female. So when you print it out, it prints out the version, but it stores it as numbers.

Why Categorical Codes are Helpful

Now, you’ll see in a moment why this is helpful. But a key thing to point out is we’re not going to have to create any dummy variables. And even that first, second, or third class, we’re not going to consider that categorical at all. And you’ll see why in a moment. A random forest is an ensemble of trees.

[01:31:00]

Random Forests as Ensembles of Trees

A tree is an ensemble of binary splits. And so we’re going to work from the bottom up. We’re going to first work, we’re going to first learn about what is a binary split.

Binary Splits in Decision Trees

And we’re going to do it by looking at an example. Let’s consider what would happen if we took all the passengers on the Titanic and grouped them into males and females.

Example of Binary Split with Sex

And let’s look at two things. The first is let’s look at their survival rate. So about 20% survival rate for males and about 75% for females. And let’s look at the histogram. How many of them are there? About twice as many males as females. Consider what would happen if you created the world’s simplest model, which was what sex are they? Well, it wouldn’t be bad, would it? Because there’s a big difference between the males and the females, a huge difference in survival rate. So if we said, oh, if you’re a man, you probably died.

[01:32:01]

If you’re a woman, you probably survived. Or not just a man or a boy, so a male or a female. That would be a pretty good model because it’s done a good job of splitting it into two groups that have very different survival rates. This is called a binary split. A binary split is something that splits the rows into two groups, hence binary. Let’s talk about another example of a binary split. I’m getting ahead of myself.

Evaluating a Binary Split Model

Before we do that, let’s look at what would happen if we used this model. So if we created a model which just looked at sex, how good would it be? So to figure that out, we first have to split into training and test sets.

Splitting into Training and Test Sets

So let’s go ahead and do that. And then let’s convert all of our categorical variables into their codes. So we’ve now got 0, 1, 2, whatever. We don’t have male or female there anymore.

[01:33:04]

And let’s also create something that returns the independent variables, which we’ll call the x’s, and the dependent variable, which we’ll call y.

Creating Independent and Dependent Variables

And so we can now get the x’s and the y’s for each of the training set and the validation set. And so now let’s create some predictions.

Making Predictions with Binary Split

We’ll predict that they survived if their sex is 0, so if they’re female. So how good is that model? Remember I told you that to calculate mean absolute error, we can get scikit-learn or PyTorch or whatever to do it for us instead of doing it ourselves.

Using scikit-learn for Mean Absolute Error

So just showing you, here’s how you do it just by importing it directly. This is exactly the same as the one we did manually in the last notebook. So that’s a 21.5% error. So that’s a pretty good model.

[01:34:03]

Could we do better?

Example of Binary Split with Fair

Well, here’s another example. What about fair? So fair is different to sex because fair is continuous, or log fair I’ll take. But we could still split it into two groups. So here’s, for all the people that didn’t survive, this is their median fair here, and then this is their quartiles. For bigger fairs and quartiles for smaller fairs. And here’s the median fair for those that survived and their quartiles. So you can see the median fair for those that survived is higher than the median fair for those that didn’t. We can’t create a histogram exactly for fair because it’s continuous. We could bucket it into groups to create a histogram. So I guess we can create a histogram. That wasn’t true.

Kernel Density Plot for Continuous Variables

What I should say is we could create something better, which is a kernel density plot, which is just like a histogram, but it’s like with infinitely small bins.

[01:35:05]

So we can see most people have a log fair of about two. So what if we split on about a bit under three? You know, that seems to be a point at which there’s a difference in survival between people that are greater than or less than that amount. So here’s another model. Log fair greater than 2.7. Oh, much worse, 0.336 versus 0.215. Well, I don’t know. Maybe this is something better.

Interactive Tool for Binary Split Scoring

We could create a little interactive tool. So what I want is something that can give us a quick score of how good a binary split is. And I want it to be able to work regardless of whether we’re dealing with categorical or continuous or whatever data.

[01:36:05]

Scoring Binary Splits

So I just came up with a simple little way of scoring, which is I said, OK, if you split your data into two groups, a good split would be one in which all of the values of the dependent variable on one side are all pretty much the same, and all of the dependent variables on the other side are all pretty much the same. For example, if pretty much all the males had the same survival outcome, which is didn’t survive, and all the females had about the same survival outcome, which is they did survive, that would be a good split, right? It doesn’t just work for categorical variables. It would work if your dependent variable was continuous as well. You basically want each of your groups within group to be as similar as possible on the dependent variable. And then the other group, you want them to be as similar as possible on the dependent variable. So how similar is all the things in a group?

[01:37:02]

Standard Deviation and Group Size

That’s a standard deviation. So what I want to do is basically add the standard deviations of the two groups of the dependent variable. And then if there’s a really small standard deviation, but it’s a really small group, that’s not very interesting. So I multiply it by the size, right? So this is something which says, what’s the score for one of my groups, one of my sides? It’s the standard deviation multiplied by how many things are in that group. So the total score is the score for the left-hand side, so all the things in one group, plus the score for the right-hand side, which is, tilde means not, so not left-hand side is right-hand side.

Total Score for Binary Split

And then we’ll just take the average of that. So for example, if we split by sex is greater than or less than 0.5, that’ll create two groups, males and females, and that gives us this score.

[01:38:01]

Scoring Binary Splits with Interact

And if we do log fair, greater than or less than 2.7, it gives us this score, and lower score is better. So sex is better than log fair. So now that we’ve got that, we can use our favorite interact tool to create a little GUI. And so we can say, you know, let’s try like, oh, what about this one? Can we, oops, can we find something that’s better? No, not very good. What about p class? 0.468, 0.460. So we can fiddle around with these. We can do the same thing for the categorical variables. So we already know that sex, we can get to 0.407. What about embarked? Hmm. All right. So it looks like sex might be our best.

[01:39:02]

Finding the Best Binary Split Automatically

Well, that was pretty inefficient, right? It would be nice if we could find some automatic way to do all that. Well, of course we can. For example, if we wanted to find what’s the best split point for age, then we just have to create, let’s do this again.

Finding the Best Split Point for Age

If we want to find the best split point for age, we can just create a list of all of the unique values of age and try each one in turn and see what score we get if we made a binary split on that level of age. So here’s a list of all of the possible binary split thresholds for age. Let’s go through all of them for each of them, calculate the score.

Argmin Function for Finding Minimum Score

And then NumPy and PyTorch have an argmin function which tells you what index into that list is the smallest. So just to show you, here’s the scores.

[01:40:04]

And 0, 1, 2, 3, 4, 5, 6. Oh, sorry. 0, 1, 2, 3, 4, 5, 6. So apparently that value has the smallest score. So that tells us that for age, the threshold of 6 would be best. So here’s something that just calculates that for a column.

Function for Calculating Best Split Point

It calculates the best split point. So here’s 6, right? And it also tells us what the score is at that point, which is 0.478. So now we can just go through and calculate the score for the best split point for each column.

Finding the Best Split Point for All Columns

And if we do that, we find that the lowest score is 6.

[01:41:08]

So that is how we calculate the best binary split.

Best Binary Split for Titanic Data

So we now know that the model that we created earlier, this one, is the best single binary split model we can find.

Decision Trees and Random Forests

So next week, we’re going to learn how we can recursively do this to create a decision tree and then do that multiple times to create a random forest. But before we do, I want to point something out, which is this ridiculously simple thing, which is find a single binary split in stock, is a type of model.

1R Model and its Effectiveness

It has a name. It’s called 1R. And the 1R model, it turned out in a review of machine learning methods in the 90s, turned out to be one of the best, if not the best, machine learning classifiers for a wide range of real world datasets.

[01:42:07]

So that is to say, don’t assume that you have to go complicated.

Importance of Simple Baselines

It’s not a bad idea to always start creating a baseline of 1R, a decision tree with a single binary split. And in fact, for the Titanic competition, that’s exactly what we do.

Kaggle Sample Submission with 1R

If you look at the Titanic competition on Kaggle, you’ll find that what we did is our sample submission is one that just splits into male versus female.

Conclusion and Next Lesson

All right. Thanks, everybody. I hope you found that interesting, and I will see you next lesson. Bye.