Lesson 6: Practical Deep Learning for Coders 2022

[Jeremy Howard]

Lesson 6: Practical Deep Learning for Coders 2022

Okay, so welcome back to, well not welcome back to, welcome to Lesson 6, first time we’ve been in Lesson 6. Welcome back to Practical League Deep Learning for Coders. We just started looking at tabular data last time, and for those of you who’ve forgotten, what we did was we were looking at the Titanic dataset, and we were looking at creating binary splits by looking at categorical variables or binary variables like sex and continuous variables like the log of the fare that they paid.

Tabular Data: Titanic Dataset

And using those, you know, we also kind of came up with a score, which was basically how good a job did that split do of grouping the survival characteristics into two groups, you know, nearly all of one of whom survived, nearly all of whom the other didn’t survive, so they had like small standard deviation in each group.

[00:01:20]

And so then we created the world’s simplest little UI to allow us to fiddle around and try to find a good binary split. And we did come up with a very good binary split, which was on sex, and actually we created this at all in an automated version.

Creating a Machine Learning Algorithm from Scratch

And so this is, I think, the first time we can, well, not quite the first time, no, this is yet another time, I should say, that we have successfully created an actual machine learning algorithm from scratch. This one is about the world’s simplest one. It’s 1R, creating the single rule, which does a good job of splitting your dataset into two parts which differ as much as possible on the dependent variable.

[00:02:10]

1R Rule

1R is probably not going to cut it for a lot of things, though. It’s surprisingly effective, but maybe we could go a step further. And the other step further we could go is we could create like a 2R.

2R Rule

What if we took each of those groups, males and females in the Titanic dataset, and split each of those into two other groups? So split the males into two groups and split the males into two groups. So to do that, we can repeat the exact same piece of code we just did, but let’s remove sex from it, and then split the dataset into males and females, and run the same piece of code that we just did before, but just for the males. And so this is going to be like a 1R rule for how do we predict which males survive the Titanic.

[00:03:08]

And let’s have a look, 38, 37, 38, 38, 38. Okay, so it’s age, were they greater than or less than six? Turns out to be for the males the biggest predictor of whether they were going to survive that shipwreck. And we can do the same thing for females. So females, there we go, no great surprise, P class. So whether they were in first class or not was the biggest predictor for females of whether they would survive the shipwreck. So that has now given us a decision tree.

Decision Tree

It is a series of binary splits which will gradually split up our data more and more such that in the end, in the leaf nodes as we call them, we will hopefully get as much stronger prediction as possible about survival.

[00:04:17]

So we could just repeat this step for each of the four groups we’ve now created, males, kids and older than six, females, first class and everybody else. And we could do it again. And then we’d have eight groups. We could do that manually with another couple of lines of code. Or we can just use decision tree classifier, which is a class which does exactly that for us.

Decision Tree Classifier

So there’s no magic in here. It’s just doing what we’ve just described. And decision tree classifier comes from a library called scikit-learn. Scikit-learn is a fantastic library that focuses on kind of classical non-deep learning-ish machine learning methods like decision trees.

[00:05:09]

So to create the exact same decision tree, we can say please create a decision tree classifier with at most four leaf nodes. And one very nice thing it has is it can draw the tree for us. So here’s a tiny little draw tree function. And you can see here it’s going to first of all split on sex. Now it looks a bit weird to say sex is less than or equal to 0.5. But remember our binary characteristics are coded as 0 or 1. So that’s just how we, you know, easy way to say males versus females. And then here we’ve got for the females, what class are they in? And for the males, what age are they?

[00:06:00]

And here’s our four leaf nodes. So for the females in first class, 116 of them survived and four of them didn’t. So very good idea to be a well-to-do woman on the Titanic. On the other hand, males, adults, 68 survived, 350 died. So very bad idea to be a male adult on the Titanic. So you can see you can kind of get a quick summary of what’s going on. And one of the reasons people tend to like decision trees, particularly for exploratory data analysis, is it doesn’t allow us to get a quick picture of what are the key driving variables in this data set and how much do they kind of predict what was happening in the data.

[00:07:00]

Okay, so it’s around the same splits as us. And it’s got one additional piece of information we haven’t seen before, which is this thing called Gini.

Gini

Gini is just another way of measuring how good a split is. And I’ve put the code to calculate Gini here. Here’s how you can think of Gini. How likely is it that if you go into that sample and grab one item and then go in again and grab another item, how likely is it that you’re going to grab the same item each time? And so if the entire leaf node is just people who survived or just people who didn’t survive, the probability would be one. You get the same every time. If it was an exactly equal mix, the probability would be 0.5. So that’s why we just, yeah, that’s where this formula comes from in the binary case.

[00:08:00]

And you can see it here, right? This group here is pretty much 50-50, so Gini is 0.5. Whereas this group here is nearly 100% in one class, so Gini is nearly zero. I had it backwards. It’s one minus. And I think I’ve written it backwards here as well, so I better fix that. So this decision tree is, you know, we would expect it to be all accurate, so we can calculate it to be an absolute error.

Accuracy Score

And for the one r, so just doing males versus females, what was our score? Here we go. 0.407. Actually, do we have an accuracy score somewhere? Here we are. 0.336. That was for log fair. And for sex, it was 0.215. Okay, so 0.215. So that was for the one r version.

[00:09:06]

For the decision tree with four leaf nodes, 0.224. So it’s actually a little worse, right? And I think this just reflects the fact that this is such a small data set, and the one r version was so good, we haven’t really improved it that much. But not enough to really see it amongst the randomness of such a small validation set.

Minimum Samples per Leaf Node

We could go further to 50, a minimum of 50 samples per leaf node. So that means that in each of these, see how it says samples, which in this case is passengers on the Titanic, there’s at least there’s 67 people that were female, first class, less than 28. That’s how you define that. So this decision tree keeps building, keeps splitting, until it gets to a point where there’s going to be less than 50, at which point it stops splitting that leaf.

[00:10:05]

So you can see they’ve all got at least 50 samples. And so here’s the decision tree that builds. As you can see, it doesn’t have to be like constant depth, right? So this group here, which is males who had cheaper fares and who were older than 20, but younger than 32, actually younger than 24, and actually super cheap fares and so forth, right? So it keeps going down until we get to that group. So let’s try that decision tree. So that decision tree has an absolute error of 0.183. So not surprisingly, you know, once we get there, it’s starting to look like it’s a little bit better.

Kaggle Competition

So there’s a model and this is a Kaggle competition.

[00:11:00]

So therefore, we should submit it to the leaderboard. And, you know, one of the biggest mistakes I see, not just beginners, but every level of practitioner make on Kaggle, is not to submit to the leaderboard. Spend months making some perfect thing, right? But you’re actually going to see how you’re going and you should try and submit something to the leaderboard every day. So, you know, regardless of how rubbish it is, because you want to improve every day. So you want to keep iterating. So to submit something to the leaderboard, you generally have to provide a CSV file.

CSV File

And so we’re going to create a CSV file. And we’re going to apply the category codes to get the category for each one in our test set. We’re going to set the survive column to our predictions.

[00:12:00]

And then we’re going to send that off to a CSV. And so, yeah, so I submitted that. And I got a score a little bit worse than most of our linear models and neural nets. But not terrible. It’s doing an okay job.

Preprocessing

Now, one interesting thing for the decision tree is there was a lot less preprocessing to do. Did you notice that? We didn’t have to create any dummy variables for our categories. And, like, you certainly can create dummy variables, but you often don’t have to. So, for example, you know, for class, you know, it’s one, two, or three. You can just split on one, two, or three, you know? Even for, like, what was that thing? Like, the embarkation city code. We just convert them kind of arbitrarily to numbers, one, two, and three, and you can split on those numbers.

[00:13:03]

So, with random forest, not random forest, decision trees, yeah, you can generally get away with not doing stuff like dummy variables.

Dummy Variables

In fact, even taking the log of fair, we only did that to make our graph look better. But if you think about it, splitting on log fair less than 2.7 is exactly the same as splitting on fair is less than 8 of the 2.7, you know, or whatever log base we used, I can’t remember. So, all that a decision tree cares about is the ordering of the data. And this is another reason that decision tree based approaches are fantastic, because they don’t care at all about outliers, you know, long tail distributions, categorical variables, whatever. You can throw it all in, and it’ll do a perfectly fine job. So, for tabular data, I would always start by using a decision tree based approach, and kind of create some baselines and so forth, because it’s really hard to mess it up.

[00:14:17]

And that’s important. So, yeah, so here, for example, is embarked, right? It was coded originally as the first letter of the city they embarked in, but we turned it into a categorical variable, and so pandas for us creates this vocab, this list of possible values. And if you look at the codes attribute, you can see it’s the S is the 0, 1, 2. So, S has become 2. C has become 0, and so forth, right? So, that’s how we’re converting the categories, the strings, into numbers that we can sort and group by.

[00:15:05]

So, yeah, so if we wanted to split C into one group and Q and S into the other, we can just do, okay, less than or equal to 0.5. Now, of course, if we wanted to split C and S into one group and Q into the other, we would need two binary splits. First, C on one side and Q and S on the other, and then Q and S into Q versus S, and then the Q and S leaf nodes could get similar predictions. So, like, you do have, like, sometimes it can take a little bit more messing around, but most of the time, I find categorical variables work fine as numeric in decision tree based approaches, and as I say here, I tend to use dummy variables only if there’s, like, less than four levels. Now, what if we wanted to make this more accurate?

Growing the Tree Further

Could we grow the tree further?

[00:16:00]

I mean, we could, but you know, there’s only 50 samples in these leaves, right? It’s not really, you know, if I keep splitting it, the leaf nodes are going to have so little data that it’s not really going to make very useful predictions. Now, there are limitations to how accurate a decision tree can be.

Limitations of Decision Trees

So, what can we do? We can do something that’s actually very, I mean, I find it amazing and fascinating.

Bagging

It comes from a guy called Leo Breiman, and Leo Breiman came up with this idea called bagging, and here’s the basic idea of bagging. Let’s say we’ve got a model that’s not very good, because let’s say it’s a decision tree, it’s really small, we’ve hardly used any data for it, right?

[00:17:07]

It’s not very good. So, it’s got error. It’s got errors on predictions. It’s not a systematically biased error. It’s not always predicting too high or always predicting too low. I mean, decision trees, you know, on average will predict the average, right? But it has errors. So, what I could do is I could build another decision tree in some slightly different way that would have different splits, and it would also be not a great model, but predicts the correct thing on average, it’s not completely hopeless, and again, you know, some of the errors are a bit too high and some are a bit too low. And I could keep doing this. So, if I could keep building lots and lots of slightly different decision trees, I’m going to end up with, say, 100 different models, all of which are unbiased, all of which are better than nothing, and all of which have some errors a bit high, some a bit low, whatever.

[00:18:05]

So, what would happen if I averaged their predictions? Assuming that the models are not correlated with each other, then you’re going to end up with errors on either side of the correct prediction. Some are a bit high, some are a bit low. There’ll be this kind of distribution of errors, right? And the average of those errors will be zero. And so that means the average of the predictions of these multiple uncorrelated models, each of which is unbiased, will be the correct prediction, because they have an error of zero. And this is a mind-blowing insight. It says that if we can generate a whole bunch of uncorrelated, unbiased models, we can average them and get something better than any of the individual models, because the average of the error will be zero.

[00:19:07]

So, all we need is a way to generate lots of models. Well, we already have a great way to build models, which is to create a decision tree. How do we create lots of them? How do we create lots of unbiased but different models?

Random Forest

Well, let’s just grab a different subset of the data each time. Let’s just grab at random half the rows and build a decision tree. And then grab another half the rows and build a decision tree. Grab another half the rows and build a decision tree. Each of those decision trees is going to be not great. It’s only using half the data. But it will be unbiased. It will be predicting the average on average. It will certainly be better than nothing, because it’s using some real data to try and create a real decision tree. They won’t be correlated with each other, because they’re each random subsets.

[00:20:01]

So, that meets all of our criteria for bagging. When you do this, you create something called a random forest. So, let’s create one in four lines of code. So, here is a function to create a decision tree. So, let’s say this is just the proportion of data. So, let’s say we put 75 percent of the data in each time. Or we could change it to 50 percent, whatever. So, this is the number of samples in this subset. I’ll call it n. And so, let’s at random choose n times the proportion we requested from the sample and build a decision tree from that. And so, now, let’s 100 times get a tree and stick them all in a list using a list comprehension. And now, let’s grab the predictions for each one of those trees.

[00:21:04]

And then let’s stack all those predictions up together and take their mean. And that is a random forest. And what do we get? One, two, three, four, five, six, seven, eight, that’s seven lines of code. So, random forests are very simple.

Random Forest Classifier

This is a slight simplification. There’s one other difference that random forests do, which is when they build the decision tree, they also randomly select a subset of columns. And they select a different random subset of columns each time they do a split. And so, the idea is you kind of want it to be as random as possible, but also somewhat useful. So, we can do that by creating a random forest classifier.

[00:22:03]

Say how many trees do we want? How many samples per leaf? And then fit does what we just did. And here’s our mean absolute error, which, again, it’s like not as good as our decision tree, but it’s still pretty good. And again, it’s such a small data set, it’s hard to tell if that means anything. And so, we can submit that to Kaggle. So, earlier on, I created a little function to submit to Kaggle. So, now I just create some predictions and I submit to Kaggle. And, yeah, looks like it gave nearly identical results to a single tree. Now to one of my favorite things about random forests.

Random Forest vs. Decision Tree

And I should say, in most real-world data sets of reasonable size, random forests basically always give you much better results than decision trees. This is just a small data set to show you what to do. One of my favorite things about random forests is we can do something quite cool with it.

[00:23:02]

Feature Importance Plot

What we can do is we can look at the underlying decision trees they create. So, we’ve now got a hundred decision trees. And we can see what columns did it find to split on. And so, it’d say here, okay, well, the first thing it split on was sex. And it improved the GINI from 0.47 to, now just take the weighted average of 0.38 and 0.31 weighted by the samples. So, that’s probably going to be, I don’t know, about 0.33. So, it’d say, okay, it’s like 0.14 improvement in GINI thanks to sex. And we can do that again, okay, well, then P class, you know, how much did that improve GINI? Again, we keep weighting it by the number of samples as well. Log fair, how much did that improve GINI? And we can keep track for each column of how much in total did they improve the GINI in this decision tree.

[00:24:02]

And then do that for every decision tree. And then add them up per column. And that gives you something called a feature importance plot. And here it is. And a feature importance plot tells you how important is each feature. How often did the trees pick it and how much did it improve the GINI when it did. And so, we can see from the feature importance plot that sex was the most important. And class was the second most important and everything else was a long way back. And this is another reason, by the way, our random forest isn’t really particularly helpful because it’s just such an easy split to do, right? Basically, all that matters is, you know, what class you’re in and whether you’re male or female. And these feature importance plots, remember, because they’re built on random forests and random forests don’t care about really the distribution of your data and they can handle categorical variables and stuff like that, that means that you can basically any tabular data set you have, you can just plot this right away.

[00:25:20]

Feature Importance Plots for Tabular Data

And random forests, you know, for most data sets only take a few seconds to train. You know, really at most a minute or two. And so, if you’ve got a big data set and, you know, hundreds of columns, do this first. And find the 30 columns that might matter. It’s such a helpful thing to do. So, I’ve done that, for example, I did some work in credit scoring. So, we’re trying to find out which things would predict who’s going to default on a loan. And I was given something like 7,000 columns from the database.

[00:26:01]

And I put it straight into a random forest and found, I think, there was about 30 columns that seemed kind of interesting. I did that like two hours after I started the job. And I went to the head of marketing and the head of risk and I told them, here’s the columns I think that we should focus on. And they were like, oh, my God, we just finished a two-year consulting project with one of the big consultants. Paid the millions of dollars and they came up with a subset of these. There are other things that you can do with random forests along this path.

Chapter 9: Auction Prices of Heavy Industrial Equipment

I’ll touch on them briefly. And specifically, I’m going to look at chapter 8 of the book. Which goes into this in a lot more detail.

[00:27:00]

And particularly interestingly, chapter 8 of the book uses a much bigger and more interesting data set, which is auction prices of heavy industrial equipment. I mean, it’s less interesting historically, but more interestingly numerically. And so, some of the things I did there on this data set, so this isn’t from the data set, this is from the scikit-learn documentation.

Number of Estimators

They looked at how as you increase the number of estimators, so the number of trees, how much does the accuracy improve? So, I then did the same thing on our data set. So, I actually just added up to 40 more and more and more trees. And you can see that basically as predicted by that kind of initial bit of hand-wavy theory I gave you, that you’d expect the more trees, lower the error, because the more things you’re averaging.

[00:28:03]

And that’s exactly what we find, the accuracy improves as we have more trees. John, what’s up?

[John]

Increasing the Number of Trees

Victor, you might have just answered his question actually as he typed it. But he’s asking on the same theme, the number of trees in a random forest. Does increasing the number of trees always translate to better error? Yes, it does, always.

[Jeremy Howard]

I mean, tiny bumps, right? But yeah, once you smooth it out. But decreasing returns, and if you end up productionizing a random forest, of course, every one of these trees, you have to, you know, go through at inference time. So, it’s not that there’s no cost. I mean, having said that, zipping through a binary tree is the kind of thing you can really do fast.

[00:29:02]

In fact, it’s quite easy to literally spit out C++ code with a bunch of if statements and compile it and get extremely fast performance. I don’t often use more than 100 trees. This is a rule of thumb. Is that the only one, John? Okay. So, then there’s another interesting feature of random forests, which is remember how in our example, we trained with 75% of the data on each tree. So, that means for each tree, there was 25% of the data we didn’t train on.

Out of Bag Error (OOB Error)

Now, this actually means if you don’t have much data, in some situations, you can get away with not having a validation set. And the reason why is because for each tree, we can pick the 25% of rows that weren’t in that tree and see how accurate that tree was on those rows.

[00:30:11]

And we can average for each row their accuracy on all of the trees in which they were not part of the training. And that is called the out of bag error or OOB error. And this is built in also to sklearn. You can ask for an OOB prediction. John?

[John]

Just before we move on, Zakia has a question about bagging.

Bagging vs. Deep Learning

So, we know that bagging is powerful as an ensemble approach to machine learning. Would it be advisable to try out bagging then first when approaching a particular, say, tabular task before deep learning? So, that’s the first part of the question.

[00:31:01]

And the second part is, could we create a bagging model which includes fast AI deep learning models?

[Jeremy Howard]

Yes. Absolutely. So, to be clear, you know, bagging is kind of like a meta method. It’s not a prediction. It’s not a method of modeling itself. It’s just a method of combining other models. So, random forests in particular, as a particular approach to bagging, is a, you know, I would probably always start personally a tabular project with a random forest because they’re nearly impossible to mess up and they give good insight and they give a good base case. But, yeah, your question then about can you bag other models is a very interesting one.

Bagging Other Models

And the answer is you absolutely can. And people very rarely do. But we will. We will quite soon.

[00:32:02]

Maybe even today. So, you know, you might be getting the impression I’m a bit of a fan of random forests. And before I was, before, you know, people thought of me as the deep learning guy, people thought of me as the random forests guy. I used to go on about random forests all the time. And one of the reasons I’m so enthused about them isn’t just that they’re very accurate or that they require, you know, that they’re very hard to mess up and require very little processing, preprocessing, but they give you a lot of quick and easy insight.

Random Forest Insights

And specifically, these are the five things which I think that we’re interested in and all of which are things that random forests are good at. They will tell us how confident are we in our predictions on some particular row. So when somebody, you know, when we’re giving a loan to somebody, we don’t necessarily just want to know how likely are they to repay, but we’d also like to know how confident are we that we know.

[00:33:04]

Because if we’re, if we’re like, well, we think they’ll repay, but we’re not confident of that, we would probably want to give them less of a loan. And another thing that’s very important is when we’re then making a prediction. So again, for example, for credit, let’s say you rejected that person’s loan. Why? And a random forest will tell us what, what is the, what is the reason that we made a prediction and you’ll see why all these things. Which columns are the strongest predictors? You’ve already seen that one, right? That’s the feature importance plot. Which columns are effectively redundant with each other, i.e. they’re basically highly correlated with each other. And then one of the most important ones is you vary a column, how does it vary the predictions? So for example, in your credit model, how does your prediction of risk vary as you vary, well, something that probably the regulator would want to know might be some, you know, some protected variable, like, you know, race or some sociodemographic characteristics that you’re not allowed to use in your model.

[00:34:18]

So they might check things like that. For the first thing, how confident are we in our predictions using a particular row of data?

Prediction Confidence

There’s a really simple thing we can do, which is remember how, when we calculated our predictions manually, we stacked up the predictions together and took their standard deviation. What does that mean? Well, what if you took their standard deviation instead? So if you stack up your predictions and take their standard deviation, and if that standard deviation is high, that means all of them, all of the trees are predicting something different. And that suggests that we don’t really know what we’re doing. And so that would happen if different subsets of the data end up giving completely different trees for this particular row.

[00:35:07]

So there’s, like, a really simple thing you can do to get a sense of your prediction confidence. Okay. Feature importance we’ve already discussed.

Feature Importance

After I do feature importance, you know, like I said, when I had the, what, 7,000 or so columns, I got rid of, like, all but 30. That doesn’t tend to improve the predictions of your random forest very much, if at all. But it certainly helps, like, you know, kind of logistically thinking about cleaning up the data. You can focus on cleaning those 30 columns, stuff like that. So I tend to remove the low importance variables. I’m going to skip over this bit about removing redundant features, because it’s a little bit outside what we’re talking about, but definitely check it out in the book.

Redundant Features

Something called a dendrogram. Okay. But what I do want to mention is partial dependence.

[00:36:01]

Partial Dependence Plot

This is the thing which says, what is the relationship between a column and the dependent variable? And so this is something called a partial dependence plot. Now, this one’s actually not specific to random forests. A partial dependence plot is something you can do to basically any machine learning model. Um, let’s first of all look at one, and then talk about how we make it. So in this data set, we’re looking at the relationship, we’re looking at, uh, the sale price at auction of heavy industrial equipment, like bulldozers. This is specifically the blue books for bulldozers Kaggle competition. And a partial dependence plot between the year that the bulldozer or whatever was made, and the price it was sold for, this is actually the log price, is that it goes up. More recent bulldozers, more recently made bulldozers are more expensive.

[00:37:00]

And as you go back, back to older and older bulldozers, they’re less and less expensive to a point. And maybe these ones are some old classic bulldozers you pay a bit extra for. Now, you might think that you could easily create this plot by simply looking at your data at each year and taking the average sale price. But that doesn’t really work very well. I mean, it kind of does, but it kind of doesn’t. Let me give an example. It turns out that one of the biggest predictors of sale price for industrial equipment is whether it has air conditioning. Um, and so air conditioning is, you know, it’s an expensive thing to add, and it makes the equipment more expensive to buy. And most things didn’t have air conditioning back in the 60s and 70s, and most of them do now. So if you plot the relationship between year made and price, you’re actually going to be seeing a whole bunch of when, you know, how popular was air conditioning, right?

[00:38:02]

So you get this, this cross correlation going on. But we just want to know, no, what’s, what’s just the impact of, of the year it was made, all else being equal. So there’s actually a really easy way to do that, which is we take our data set, we take the, uh, we, and we leave it exactly as it is to just use the training data set, but we take every single row and for the year made column, we set it to 1950. And so then we predict for every row, what would the sale price of that have been if it was made in 1950. And then we repeat it for 1951 and they repeat it for 1952 and so forth. And then we plot the averages. And that does exactly what I just said. Remember I said the special words, all else being equal, this is setting everything else equal. It’s the, everything else is the data as it actually occurred. And we’re only varying year made. And that’s what a partial dependence plot is. That works just as well for deep learning or gradient boosting trees or logistic regressions or whatever.

[00:39:06]

It’s a really cool thing you can do. Um, and you can do more than one column at a time. You know, you can do two-way partial dependence plots, for example. Another one. Um, okay. So then another one I mentioned was, can you describe why a particular prediction was made?

Describing Why a Prediction Was Made

So how did you decide for this particular row to predict this particular value? And, um, this is actually pretty easy to do. There’s a thing called tree interpreter, but we could, you could easily create this in about half a dozen lines of code. Um, all we do is, um, we’re saying, okay, uh, this customer’s come in, they’ve asked for a loan. We put in all of their data through the random forest. It’s bad out of prediction.

[00:40:00]

We can actually have a look and say, okay, well that in tree number one, what’s the path that went down through the tree to get to the leaf node? And we can say, oh, well, first of all, it looked at sex and then it looked at postcode and then it looked at income. And so we can see exactly in tree number one, which variables were used and what was the change in Gini for each one. And then we can do the same in tree two, same in tree three, same in tree four. Does this sound familiar? It’s basically the same as our feature importance plot, right? But it’s just for this one row of data. And so that will tell you basically the feature importances for that one particular prediction. And so then we can plot them like this. So for example, this is an example of, um, an auction price prediction. And according to this plot, you know, so we predicted that the net would be, uh, oh, this is just a change from, from, uh, so I don’t actually know what the price is, this is how much each one impacted the price.

[00:41:06]

So year made, I guess this must’ve been an older tractor. It caused a prediction of the price to go down, but then it must’ve been a larger machine. The product size caused it to go up. Coupler system made it go up. Model ID made it go up and so forth. Right? So you can see the red says this made, this made our prediction go down. Green made our prediction go up. And so overall you can see which things had the biggest impact on the prediction and what was the direction for each one. So it’s basically a feature importance plot, but just for a single row, for a single row. Uh, any questions, John?

[John]

Excluding a Tree from a Forest

Yeah, there are a couple that have, that have sort of queued up. This is a, this is a good spot to, um, to jump to them. Um, so first of all, Andrew is asking, uh, jumping back to the, um, the OOB era, would you ever exclude a tree from a forest if it had a, if it had a bad out of bag error?

[00:42:11]

Like if you, if you had a, I guess if you had a particularly bad tree in your ensemble, might you just drop it?

[Jeremy Howard]

Would you delete a tree that was not doing its thing? It’s not playing its part. No, you wouldn’t. Um, if you start deleting trees, then you are no longer having a unbiased prediction of the dependent variable. You are biasing it by making a choice. So even the bad ones will be improving the quality of the overall average.

[John]

All right. Thank you. Um, Zakia followed up with the question about bagging and we’re just sort of going, you know, layers and layers here.

Ensembles of Bagged Models

Uh, you know, we could go on and create ensembles of bagged models. Um, and you know, is it reasonable to assume that they would continue?

[00:43:03]

[Jeremy Howard]

So that’s not going to make much difference, right? If they’re all like, you could take your a hundred trees, split them into groups of 10, create 10 bagged ensembles, and then average those. But the average of an average is the same as the average. Um, you could like have a wider range of other kinds of models. You could have like neural nets trained on different subsets as well. But again, it’s just the average of an average will still give you the average.

[John]

Right. So there’s not a lot of value in, in kind of structuring the ensemble. You just.

[Jeremy Howard]

I mean, some, some ensembles you can structure, but, but not bagging. Bagging’s the simplest one. It’s the one I mainly use. Um, there are more sophisticated approaches, but this one is nice and easy.

[John]

All right. And there’s, there’s one that, um, is a bit specific and it’s referencing content you haven’t covered, but we’re here now. So, um, and it’s on explainability.

Explainability Techniques

Uh, so feature importance of a random forest model sometimes has different results when you compare it to other explainability techniques, um, like SHAP, S-H-A-P, or LIME.

[00:44:07]

Um, and we haven’t covered these in the course, but, um, Amir is just curious if you’ve got any thoughts on which is more accurate or reliable, uh, random forest feature importance or other techniques.

[Jeremy Howard]

Um, I, I would lean towards more immediately trusting random forest feature importances over other techniques on the whole on the basis that it’s very hard to mess up a random forest. Um, so yeah, I feel like pretty confident that a random forest feature importance is going to be pretty reasonable. Um, as long as this is the kind of data which a random forest is likely to be pretty good at, you know, doing, you know, if it’s like a computer vision model, random forests aren’t particularly good at that.

[00:45:02]

And so one of the things that Bryman talked about a lot was explainability. And he’s got a great essay called the two cultures of statistics in which he talks about, I guess what we’d nowadays call kind of like data scientists and machine learning folks versus classic statisticians. And, um, he, he, he was definitely a data scientist well before the, the net label existed. And he pointed out, yeah, you know, first and foremost, you need a model that’s accurate. It needs to make good predictions. A model that makes bad predictions will also be bad for making explanations because it doesn’t actually know what’s going on. Um, so if, you know, if you, if you’ve got a deep learning model, that’s far more accurate than your random forest, then it, you know, explainability methods from the deep learning model will probably be more useful because it’s explaining a model that’s actually correct. All right. Let’s take a, um, 10 minute break and we’ll come back at five past seven.

[00:46:09]

Chapter 9: Auction Prices of Heavy Industrial Equipment

Uh, welcome back. Um, one person pointed out, I noticed, uh, I got the chapter wrong. It’s chapter nine, not chapter eight in the book. I guess I can’t read. Um, somebody asked, uh, during the break, um, about overfitting, um, can you overfit a random forest?

Overfitting a Random Forest

Basically? No, not really. Um, adding more trees will make it more accurate. Um, it kind of asymptotes, so you can’t make it infinitely accurate by using infinite trees. But it’s certainly, you know, adding more trees won’t make it worse. Um, if you don’t have enough trees and you let the trees grow very deep, that could overfit. Um, so you just have to make sure you have enough trees.

[00:47:04]

Um, Radek told me about a experiment he did during, uh, Radek told me during the break about an experiment he did, um, which is something I’ve done, something similar, which is, um, adding lots and lots of randomly generated columns to a dataset and try to break the random forest.

Adding Randomly Generated Columns

And, uh, if you try it, it basically doesn’t work. It’s like, it’s really hard to confuse a random forest by giving it lots of meaningless data. It does an amazingly good job of picking out, uh, the useful stuff. As I said, you know, I had 30 useful columns out of 7,000 and it found them perfectly well. And often, you know, when you find those 30 columns, you know, you could go to, you know, I was doing consulting at the time, go back to the client and say like, tell me more about these columns. And they’d say like, oh, well, that one there, we’ve actually got a better version of that now.

[00:48:02]

There’s a new system, you know, we should grab that. And, oh, this column, actually that was because of this thing that happened last year, but we don’t do it anymore. Or, you know, like you can really have this kind of discussion about the stuff you’ve zoomed into.

Interactions

Um, you know, there are other things that you have to think about with lots of kinds of models, like particularly regression models, things like interactions. Um, you don’t have to worry about that with random forests, like because you split on one column and then split on another column, you get interactions for free as well. Um, normalization you don’t have to worry about, you know, you don’t have to have normally distributed columns. Um, so yeah, definitely worth a try.

Gradient Boosting

Um, now something I haven’t gone into, um, is gradient boosting.

[00:49:08]

Um, but if you go to explain.ai, you’ll see that my friend Terence and I have a three-part series about gradient boosting, uh, including pictures of golf made by Terence. Um, but to explain gradient boosting is a lot like random forests, but rather than training a model training, uh, fitting a tree again and again and again on different random subsets of the data. Instead, what we do is we fit very, very, very small trees. So hardly ever any splits. And we then say, okay, well, what’s the error? So, you know, um, so imagine the simplest tree would be our one R rule tree of, of male versus female, say. And then you, you take what’s called the residual.

[00:50:00]

That’s the difference between the prediction and the actual, it’s the error. And then you create another tree, which attempts to predict that very small tree. And then you create another very small tree, which tries to predict the error from that and so forth. Each one is predicting the residual from all of the previous ones. And so then to calculate a prediction, rather than taking the average of all the trees, you take the sum of all the trees, because each one has predicted the difference between the actual and all of the previous trees.

Boosting vs. Bagging

And that’s called boosting versus bagging. So boosting and bagging are two kind of meta ensembling techniques. And when bagging is applied to trees, it’s called a random forest. And when boosting is applied to trees, it’s called a gradient boosting machine or gradient boosted decision tree. Gradient boosting is generally speaking more accurate than random forests, but you can absolutely overfit.

[00:51:09]

Gradient Boosting Machine (GBM)

And so therefore, it’s not necessarily my first go-to thing. Having said that, there are ways to avoid overfitting. But yeah, it’s just, because it’s breakable, it’s not my first choice. But yeah, check out our stuff here if you’re interested. And there is stuff which largely automates the process. There’s lots of hyperparameters you have to select. People generally just try every combination of hyperparameters. And in the end, you generally should be able to get a more accurate gradient boosting model than random forest. But not necessarily by much.

Kaggle Notebook on Random Forests

Okay. So that was the Kaggle notebook on random forests, how random forests really work.

[00:52:14]

So what we’ve been doing is having this daily walkthrough where me and I don’t know how many, 20 or 30 folks get together on a Zoom call and chat about, you know, getting through the course and setting up machines and stuff like that. And, you know, we’ve been trying to kind of practice what, you know, things along the way.

Kaggle Competition: Patti Disease Classification

And so a couple of weeks ago, I wanted to show, like, what does it look like to pick a Kaggle competition and just, like, do the normal, sensible kind of mechanical steps that you would do for any computer vision model.

[00:53:05]

And so the competition I picked was Patti disease classification, which is about recognizing diseases, rice diseases and rice patties. And yeah, I spent, I don’t know, a couple of hours or three, I can’t remember, a few hours throwing together something. And I found that I was number one on the leaderboard. And I thought, oh, that’s interesting. Like, because you never quite have a sense of how well these things work. And then I thought, well, there’s all these other things we should be doing as well. And I tried three more things. And each time I tried another thing, I got further ahead at the top of the leaderboard. So I thought it’d be cool to take you through the process.

[00:54:00]

I’m going to do it reasonably quickly because the walkthroughs are all available for you to see the entire thing in, you know, seven hours of detail or however long, probably six or seven hours of conversations. But I want to kind of take you through the basic process that I went through. So since I’ve been starting to do more stuff on Kaggle, you know, I realize there’s some kind of menial steps I have to do each time, particularly because I like to run stuff on my own machine and then kind of upload it to Kaggle.

Fast Kaggle Module

So to make my life easier, I created a little module called Fast Kaggle, which you’ll see in my notebooks from now on, which you can download from Pip or Conda. And as you’ll see, it makes some things a bit easier.

[00:55:01]

For example, downloading the data for the PADI disease classification, if you just run setup comp and pass in the name of the competition, if you are on Kaggle, it will return a path to that competition data that’s already on Kaggle. If you are not on Kaggle and you haven’t downloaded it, it will download and unzip the data for you. If you’re not on Kaggle and you have downloaded and unzipped the data, it will return a path to the one that you’ve already downloaded. Also, if you are on Kaggle, you can ask it to make sure that Pip things are installed that might not be up to date otherwise. So this basically one line of code now gets us all set up and ready to go. So this path, so I ran this particular one on my own machine. So it’s downloaded and unzipped the data. I’ve also got links to the six walkthroughs so far. These are the videos.

[00:56:01]

And here’s my result after these four attempts, plus a few fiddling around at the start. So the overall approach is, and this is not just to a Kaggle competition, right?

Kaggle Competitions: Testing Models

The reason I like looking at Kaggle competitions is you can’t hide from the truth in a Kaggle competition. When you’re working on some work project or something, you might be able to convince yourself and everybody around you that you’ve done a fantastic job of not overfitting and that your model’s better than what anybody else could have made or whatever else. But the brutal assessment of the private leaderboard will tell you the truth. Is your model actually predicting things correctly? And is it overfit?

[00:57:02]

Until you’ve been through that process, you know, you’re never going to know. And a lot of people don’t go through that process because at some level they don’t want to know. But it’s okay. You don’t have to put your own name there. I always did, right from the very first one. If I was going to screw up royally, I wanted to have the pressure on myself of people seeing me in last place. But it’s fine. You can do it honestly. And you’ll actually find as you improve, you’ll have so much self-confidence. And the stuff we do in a Kaggle competition is indeed a subset of the things we need to in real life. But it’s an important subset.

Structuring Code and Analysis

You know, building a model that actually predicts things correctly and doesn’t overfit is important. And furthermore, structuring your code and analysis in such a way that you can keep improving over a three-month period without gradually getting into more and more of a tangled mess of impossible-to-understand code and having no idea what untitled copy 13 was and why it was better than 25.

[00:58:16]

This is all stuff you want to be practicing. Ideally, well away from customers or whatever, you know, before you’ve kind of figured things out. So the things I talk about here about doing things well in this Kaggle competition should work, you know, in other settings as well. And so these are the two focuses that I recommend.

Validation Set

Get a really good validation set together. We’ve talked about that before, right? And in a Kaggle competition, that’s like, it’s very rare to see people do well in a Kaggle competition who don’t have a good validation set. Sometimes that’s easy. In this competition, actually, it is easy because the test set seems to be a random sample, but most of the time it’s not actually, I would say.

[00:59:09]

And then how quickly can you iterate?

Iterating Quickly

How quickly can you try things and find out what worked? So obviously you need a good validation set, otherwise it’s impossible to iterate. And so quickly iterating means not saying, what is the biggest, you know, open AI takes four months on a hundred TPUs model that I can train. It’s what can I do that’s going to train in a minute or so? And will quickly give me a sense of like, well, I could try this. I could try that. What thing’s going to work? And then try, you know, 80 things. It also doesn’t mean that saying like, oh, I heard this, this amazing new Bayesian hyperparameter tuning approach. I’m going to spend three months implementing that because that’s going to like give you one thing.

[01:00:02]

But actually do well in these competitions or in machine learning in general, you actually have to do everything reasonably well.

Doing Everything Reasonably Well

And doing just one thing really well will still put you somewhere about last place. Um, so I actually saw that a couple of years ago, Aussie guy who’s very, very distinguished machine learning practitioner, uh, actually put together a team, entered the Kaggle competition and literally came in last place because they spent the entire three months trying to build this amazing new fancy thing and never actually, never actually iterated. Um, if you iterate, I guarantee you won’t be in last place. Okay.

Setting a Random Seed

So here’s how we can grab our data with fast Kaggle and it gives us, tells us what path it’s in.

[01:01:01]

Um, and then I set my random seed. Um, and I only do this because I’m creating a notebook to share. You know, when I share a notebook, I like to be able to say, as you can see, this is 0.83, blah, blah, blah. Right. And know that when you see it, it’ll be 0.83 as well. But when I’m doing stuff, otherwise I would never set a random seed. I want to be able to run things multiple times and see how much it changes each time. Cause that’ll give me a sense of like, uh, the modifications I’m making, changing it because they’re improving it, making it worse, or is it just random variation? So if you, if you always set a random seed, that’s a bad idea because you won’t be able to see the random variation. So this is just here for presenting a notebook.

Data Exploration

Okay. So the data, um, they’ve given us as usual, they’ve got a sample submission. They’ve got some test set images. They’ve got some training set images, a CSV file about the training set.

[01:02:00]

Um, and then these other two you can ignore cause I created them. So let’s grab a path to train images. And so do you remember get image files? So that gets us a list of the file names of all the images here recursively. Uh, so we could just grab the first one and take a look. So it’s 480 by 640.

Pillow Image

Now we’ve got to be careful. Um, this is a pillow image, Python imaging library image. Um, in the imaging world, they generally say columns by rows. In the array slash tensor world, we always say rows by columns. So if you ask PyTorch what the size of this is, it’ll say 640 by 480. And I guarantee at some point this is going to bite you. So try to recognize it now. Okay. So they’re kind of taller than they are. At least this one is taller than it is wide. Um, so I’d actually like to know, are they all this size?

[01:03:00]

Cause it’s really helpful if they all are all the same size or at least similar.

Decoding a JPEG

Um, believe it or not, the amount of time it takes to decode a JPEG is actually quite significant. Um, and so figuring out what size these things are is actually going to be pretty slow. Um, but my fast core library has a parallel sub module, which can basically do anything that you can do in Python.

Parallel Submodule

It can do it in parallel. So in this case, we wanted to create a pillow image and get it size. So if we create a function that does that and pass it to parallel, passing in the function and the list of files, it does it in parallel. And that actually runs pretty fast. And so here is the answer. I don’t know how this happened. 10,403 images are indeed 480 by 640 and four of them aren’t. So basically what this says to me is that we should pre-process them or, you know, at some point process them so that they’re probably all 480 by 640 or all basically the same size.

[01:04:02]

Preprocessing Images

We’ll pretend they’re all this size. But we can’t not do some initial resizing, otherwise this is going to screw things up.

Resizing Images

So like probably the easiest way to do things, the most common way to do things, is to either squish or crop every image to be square. So squishing is when you just, in this case, squish the aspect ratio down as opposed to cropping randomly a section out. So if we call resize squish, it will squish it down. And so this is 480 by 480 squared. So this is what it’s going to do to all of the images first on the CPU. That allows them to be all batched together into a single mini batch. Everything in a mini batch has to be the same shape, otherwise the GPU won’t like it.

[01:05:03]

Then that mini batch is put through data augmentation, and it will grab a random subset of the image and make it a 128 by 128 pixel.

Data Augmentation

And here’s what that looks like. Here’s our data. So show batch works for pretty much everything, not just in the Fast.ai library, but even for things like fast audio, which are kind of community based things. You should be able to use show batch and see or hear or whatever what your data looks like. I don’t know anything about rice disease, but apparently these are various rice diseases and this is what they look like.

Model Building

So I jump into creating models much more quickly than most people, because I find models are a great way to understand my data, as we’ve seen before.

[01:06:04]

So I basically build a model as soon as I can.

Iterating Quickly

And I want to create a model that’s going to let me iterate quickly. So that means that I’m going to need a model that can train quickly.

Best Vision Models for Fine-Tuning

So Thomas Capell and I recently did this big project, the best vision models for fine-tuning, where we looked at nearly 100 different architectures from Ross Whiteman’s TIM library, PyTorch image model library, and looked at which ones could we fine-tune, which ones had the best transfer learning results. And we tried two different datasets, very different datasets. One is the pets dataset that we’ve seen before.

[01:07:00]

So trying to predict what breed of pet is from 37 different breeds. And the other was a satellite imagery dataset called planet. So very, very different datasets in terms of what they contain and also very different sizes. The planet one’s a lot smaller, the pets one’s a lot bigger. And so the main things we measured were how much memory did it use, how accurate was it, and how long did it take to fit. And then I created this score, which combines the fit time and error rate together. And so this is a really useful table for picking a model.

Picking a Model

And now in this case, I want to pick something that’s really fast. And there’s one clear winner on speed, which is ResNet26D. And so its accuracy was 6% versus the best was like 4.1%. So okay, it’s not amazingly accurate, but it’s still pretty good.

[01:08:03]

And it’s going to be really fast. So that’s why I picked ResNet26D.

ResNet26D

A lot of people think that when they do deep learning, they’re going to spend all of their time learning about exactly how a ResNet26D is made and convolutions and ResNet blocks and transformers and blah, blah, blah. We will cover all that stuff in part two and a little bit of it next week. But it almost never matters, right? It’s just a function, right? And what matters is the inputs to it and the outputs to it and how fast it is and how accurate it is. So let’s create a learner with a ResNet26D from our data loaders. And let’s run LRFind.

LRFind

So LRFind will put through one mini batch at a time, starting at a very, very, very low learning rate and gradually increase the learning rate and track the loss.

[01:09:05]

And initially, the loss won’t improve because the learning rate is so small, it doesn’t really do anything. And at some point, the learning rate is high enough that the loss will start coming down. Then at some other point, the learning rate is so high that it’s going to start jumping past the answer and it’s going to get worse. And so somewhere around here is a learning rate we’d want to pick.

Learning Rate

We’ve got a couple of different ways of making suggestions. I generally ignore them because these suggestions are specifically designed to be conservative. They’re a bit lower than perhaps optimal in order to make sure we don’t recommend something that totally screws up. But I kind of like to say like, well, how far right can I go and still see it clearly really improving quickly? And so I’d pick somewhere around 0.01 for this.

[01:10:00]

So I can now fine tune our model with a learning rate of 0.01, three epochs.

Fine-Tuning

So look, the whole thing took a minute. That’s what we want, right? We want to be able to iterate rapidly, just a minute or so. So that’s enough time for me to go and, you know, grab a glass of water or do some reading. It’s not going to get too distracted. And what do we do before we submit?

Submitting to Kaggle

Nothing. We submit as soon as we can. OK, let’s get our submission in.

Creating a Submission

So we’ve got a model. Let’s get it in. So we read in our CSV file of the sample submission. And so the CSV file basically looks like we’re going to have to have a list of the image file names in order and then a column of labels. So we can get all the image files in the test image like so and we can sort them. And so now we want is what we want is a data loader which is exactly like the data loader we use to train the model except pointing at the test set.

[01:11:09]

Test Data Loader

We want to use exactly the same transformations. So there’s actually a DLs.testDL method which does that. You just pass in the new set of items. So the test set files. So this is a data loader which we can use for our test set. A test data loader has a key difference to a normal data loader which is that it does not have any labels. So that’s a key distinction. So we can get the predictions for our learner passing in that data loader.

Getting Predictions

And in the case of a classification problem, you can also ask for them to be decoded. Decoded means rather than just get returned the probability of every rice disease, every class, it will tell you what is the index of the most probable rice disease.

[01:12:07]

Decoded Predictions

That’s what decoded means. So that will return probabilities, targets which obviously will be empty because it’s a test set so throw them away and those decoded indexes which look like this. Numbers from 0 to 9 because there’s 10 possible rice diseases. The Kaggle submission does not expect numbers from 0 to 9. It expects to see strings like these.

Mapping Numbers to Strings

So what do those numbers from 0 to 9 represent? We can look up our vocab to get a list. So that’s 0, that’s 1, et cetera, that’s 9. So I realized later this is a slightly inefficient way to do it but it does the job. I need to be able to map these to strings. So if I enumerate the vocab, that gives me pairs of numbers, 0, bacterial leaf blight, 1, bacterial leaf streak, et cetera.

[01:13:03]

I can then create a dictionary out of that and then I can use pandas to look up each thing in a dictionary. They call that map. If you’re a pandas user, you’ve probably seen map used before being passed to function which is really, really slow. But if you pass map a dict, it’s actually really, really fast.

Pandas Map

Do it this way if you can. So here’s our predictions. So we’ve got our submission, sample submission file, SS. So if we replace this column label with our predictions, like so, then we can turn that into a CSV.

Creating a CSV

And remember, this means run a bash command, a shell command. Head is the first few rows. Let’s just take a look. That looks reasonable.

[01:14:00]

So we can now submit that to Kaggle.

Iterating Rapidly

Now, iterating rapidly means everything needs to be fast and easy. Things that are slow and hard don’t just take up your time, but they take up your mental energy. So even submitting to Kaggle needs to be fast. So I put it into a cell. So I can just run this cell.

Submitting to Kaggle

API.competition submit, this CSV file, give it a description. So I just run the cell and it submits to Kaggle. And as you can see, it says here we go, successfully submitted. So that submission was terrible. Top 80%, also known as bottom 20%, which is not too surprising, right? I mean, it’s one minute of training time. But it’s something that we can start with. And that would be like, however long it takes to get to this point that you put in our submission, now you’ve really started, right?

[01:15:06]

Because then tomorrow, you can try to make a slightly better one.

Sharing Notebooks

So I like to share my notebooks. And so even sharing the notebook, I’ve automated. So part of fast Kaggle is you can use this thing called push notebook.

Push Notebook

And that sends it off to Kaggle to create a notebook on Kaggle. There it is. And there’s my score. As you can see, it’s exactly the same thing. Why would you create public notebooks on Kaggle?

Public Notebooks on Kaggle

Well, it’s the same brutality of feedback that you get for entering a competition.

[01:16:06]

But this time, rather than finding out in no uncertain terms, whether you can predict things accurately, this time you can find out in no uncertain terms whether you can communicate things in a way that people find interesting and useful. And if you get zero votes, you know, so be it, right? That’s something to know. And then, you know, ideally, go and ask some friends, like, what do you think I could do to improve? And if they say, oh, nothing, it’s fantastic. You can tell, no, that’s not true. I didn’t get any votes. Try again. This isn’t good. How do I make it better? You know? And you can try and improve. Because if you can create models that predict things well, and you can communicate your results in a way that is clear and compelling, you’re a pretty good data scientist. You know? Like, they’re two pretty important things.

[01:17:00]

And so, here’s a great way to test yourself out on those things and improve.

[John]

Iterative Approach

Yes, John? Yes, Jeremy, we have a sort of, I think, a timely question here from Zakia about your iterative approach. And they’re asking, do you create different Kaggle notebooks for each model that you try? So, one Kaggle book for the first one, then separate notebooks subsequently, or do you append to the bottom of a single notebook?

[Jeremy Howard]

Notebook Strategy

What’s your strategy? That’s a great question. And I know Zakia’s going through the daily walkthroughs, but isn’t quite caught up yet. So, I will say keep it up, because in the six hours of going through this, you’ll see me create all the notebooks. But if I go to the actual directory I used, you can see them. So, basically, yeah, I started with, you know, what you just saw.

[01:18:01]

A bit messier without the pros, but that same basic thing. I then duplicated it to create the next one, which is here. And because I duplicated it, you know, this stuff, which I still need, it’s still there, right? And so, I run it. And I don’t always know what I’m doing, you know? And so, at first, if I don’t really know what I’m doing next, when I duplicate it, it will be called, you know, first steps in the road to the top, part one, dash, copy one, you know? And that’s okay. And as soon as I can, I’ll try to rename that once I know what I’m doing, you know? Or if it doesn’t seem to go anywhere, I’ll rename it into something like, you know, experiment, blah, blah, blah. And I’ll put some notes at the bottom, and I might put it into a folder or something. But, yeah, it’s like, it’s a very low tech approach that I find works really well, which is just duplicating notebooks and editing them and naming them carefully and putting them in order.

[01:19:11]

And, you know, put the file name in when you submit as well. And then, of course, also, if you’ve got things in Git, you know, you can have a link to the Git commit, so you’ll know exactly what it is. Generally speaking for me, my notebooks will only have one submission in, and then I’ll move on and create a new notebook, so I don’t really worry about versioning so much. But you can do that as well, if that helps you. Yeah, so that’s basically what I do. And I’ve worked with a lot of people who use much more sophisticated and complex processes and tools and stuff, but none of them seem to be able to stay as well organized as I am. I think they kind of get a bit lost in their tools sometimes. And file systems and file names, I think, are good.

[01:20:03]

[John]

AutoML Frameworks

Great, thanks. So away from that kind of dev process, more towards the specifics of, you know, finding the best model and all that sort of stuff, we’ve got a couple of questions that are in the same space, which is, you know, we’ve got some people here talking about AutoML frameworks, which you might want to, you know, touch on for people who haven’t heard of those. If you’ve got any particular AutoML frameworks you think are worth recommending, or just more generally, how do you go trying different models, random forest, gradient boosting, neural network, so in that space, if you could comment a bit.

[Jeremy Howard]

AutoML and Hyperparameter Optimization

Sure. I use AutoML less than anybody I know. I would guess. Which is to say, never. Hyperparameter optimization, never.

[01:21:01]

And the reason why is I like being highly intentional. You know, I like to think more like a scientist and have hypotheses and test them carefully and come up with conclusions. Which then I implement, you know, so for example, in this best vision models of fine tuning, I didn’t try a huge grid search of every possible model, every possible learning rate, every possible preprocessing approach, blah, blah, blah, right? Instead, step one was to find out, well, which things matter, right?

Intentional Approach

So, for example, does whether we squish or crop make a difference? You know, are some models better with squish and some models better with crop? And so we just tested that for, again, not for every possible architecture, but for one or two versions of each of the main families.

[01:22:01]

That took 20 minutes. And the answer was no, in every single case, the same thing was better. So we don’t need to do a grid search over that anymore, you know? Or another classic one is like learning rates.

Learning Rate Finder

Most people do a kind of grid search over learning rates or they’ll train a thousand models, you know, with different learning rates. But this fantastic researcher named Leslie Smith invented the learning rate finder a few years ago. We implemented it, I think, within days of it first coming out as a technical report. And that’s what I’ve used ever since because it works well and runs in a minute or so.

Choosing Models

Yeah. I mean, then like neural nets versus GBMs versus random forests. I mean, that’s, that shouldn’t be too much of a question on the whole. Like they have pretty clear places that they go.

[01:23:03]

Like if I’m doing computer vision, I’m obviously going to use a computer vision deep learning model. And which one I would use? Well, if I’m transfer learning, which hopefully is always, I would look up the two tables here.

Best Vision Models for Fine-Tuning

This is my table for pets, which is, which are the best at fine tuning to very similar things to what they were pre-trained on. And then the same thing for planet is which ones are best for fine tuning for two data sets that are very different to what they’re trained on. And as it happens in both case, they’re very similar in particular context is right up towards the top in both cases. So I just like to have these rules of thumb and yeah, my rule of thumb for tabular is random forest is going to be the fastest, easiest way to get a pretty good result.

Tabular Data: Random Forest, GBM, Neural Networks

GBM’s probably going to give me a slightly better result if I need it and can be bothered fussing around.

[01:24:00]

GBM, I would probably, yeah, actually I probably would run a hyperparameter sweep because it is fiddly and, and it’s fast. So you may as well. So yeah, so now, so, you know, we were able to make a slightly better submission, slightly better model.

Kaggle Iteration Speed

And so I had a couple of thoughts about this. The first thing was that thing trained in a minute on my home computer. And then when I uploaded it to Kaggle, it took about four minutes per epoch, which was horrifying. And Kaggle’s GPUs are not amazing, but they’re not that bad.

Virtual CPUs

So I knew something was up. And what was up is I realized that they only have two virtual CPUs, which nowadays is tiny.

[01:25:02]

Like, you know, you generally want as a rule of thumb about eight physical CPUs per GPU. And so spending all of its time just reading the data. Now, the data was 640 by 480, and we were ending up with only 128 pixel size bits for speed. So there’s no point doing that every epoch. So step one was to make my Kaggle iteration faster as well.

Resizing Images

And so very simple thing to do, resize the images. So Fast.ai has a function called resize images. And you say, okay, take all the trained images and stick them in the destination, making them this size recursively. And it will recreate the same folder structure over here. And so that’s why I call this the training path, because this is now my training data.

[01:26:00]

And so when I then trained on that on Kaggle, it went down to four times faster. With no loss of accuracy. So that was kind of step one, was to actually get my fast iteration working. Now, still, I mean, it’s a long time. And on Kaggle, you can actually see this little graph showing how much the CPU is being used, how much the GPU is being used. On your own home machine, there are tools, free tools to do the same thing. I saw that the GPU was still hardly being used. So still, CPU was being driven pretty hard. I wanted to use a better model anyway to move up the leaderboard.

Convex Tiny Model

So I moved from a oh, by the way, this graph is very useful.

Speed vs. Error Rate

So this is speed versus error rate by family.

[01:27:05]

And so we’re about to be looking at these convex models. So we’re going to be looking at this one, convex tiny. Here it is, convex tiny. So we were looking at ResNet 2060, which took this long on this data set. But this one here is nearly the best. It’s third best. But it’s still very fast. And so it’s the best overall score. So let’s use this, particularly because, you know, we’re still spending all of our time waiting for the CPU anyway. So it turned out that when I switched my architecture to convex, it basically ran just as fast on Kaggle.

Training the Convex Tiny Model

So we can then train that. Let me switch to the Kaggle version because my outputs are missing for some reason.

[01:28:03]

So, yeah, so I started out by running the ResNet 2060 on the resized images and got similar error rate. But I ran a few more epochs, got 12% error rate. And so then I do exactly the same thing, but with convex small and 4.5% error rate.

Convex Next Model

So don’t think that different architectures are just tiny little differences. This is over twice as good. And a lot of folks you talk to will never have heard of this, convex next, because it’s very new. And I’ve noticed a lot of people tend not to keep up to date with new things. They kind of learn something at university and then they stop learning. So if somebody’s still just using ResNets all the time, you know, you can tell them we’ve actually we’ve moved on, you know.

[01:29:00]

ResNets are still probably the fastest. But for the mix of speed and performance, you know, not so much.

Convex Next: Rules of Thumb

Convex next, you know, again, you want these rules of thumb, right? If you’re not sure what to do, this, convex next. And then like most things, there’s different sizes. There’s a tiny, there’s a small, there’s a base, there’s a large, there’s an extra large. And, you know, it’s just, well, let’s look at the picture. This is it here, right? Large takes longer, but lower error. Tiny takes less time, but higher error, right? So you pick about your speed versus accuracy tradeoff for you.

Speed vs. Accuracy Tradeoff

So for us, small is great. And so, yeah, now we’ve got a 4.5% error.

[01:30:00]

That’s terrific.

Iterating Further

Now let’s iterate. On Kaggle, this is taking about a minute per epoch. On my computer, it’s probably taking about 20 seconds per epoch, so not too bad. So, you know, one thing we could try is instead of using Squish as our preprocessing, let’s try using crop.

Cropping Images

So that will randomly crop out an area. And that’s the default. So if I remove the method equals Squish, that will crop. So you see how I’ve tried to get everything into a single function, right? This single function, I can tell it, let’s go and find the definition. What architecture do I want to train? How do I want to transform the items? How do I want to transform the batches? And how many epochs do I want to do? That’s basically it. Right? So this time, I want to use the same architecture, conf next. I want to resize without cropping and then use the same data augmentation.

[01:31:01]

And, okay, error rate’s about the same. So not particularly, it’s a tiny bit worse, but not enough to be interesting. Instead of cropping, we can pad.

Padding Images

Now padding’s interesting. Do you see how these are all square? Right? But they’ve got black borders. So padding’s interesting because it’s the only way of preprocessing images, which doesn’t distort them and doesn’t lose anything. If you crop, you lose things. If you squish, you distort things. This does neither. Now, of course, the downside is that there’s pixels that are literally pointless. They contain zeros. So every way of getting this working has its compromises. But this approach of resizing where we pad with zeros is not used enough. And it can actually often work quite well. And in this case, it was about as good as our best so far. But no, not huge differences yet.

[01:32:02]

Test Time Augmentation (TTA)

What else could we do? Well, what we could do is, see these pictures? This is all the same picture. But it’s gone through our data augmentation. So sometimes it’s a bit darker. Sometimes it’s flipped horizontally. Sometimes it’s slightly rotated. Sometimes it’s slightly warped. Sometimes it’s zooming into a slightly different section. But this is all the same picture. Maybe our model would like some of these versions better than others. So what we can do is we can pass all of these to our model. Get predictions for all of them. And take the average. So it’s our own kind of like little mini bagging approach. And this is called test time augmentation. Fast.ai is very unusual in making that available in a single method.

[01:33:00]

You just pass TTA. And it will pass multiple augmented versions of the image and average them for you. And so this is the same model as before, which had a 4.5%. So instead if we get TTA predictions and then get the error rate. Wait, why does it say 4.8? Last time I did this it was way better. That’s messing things up, isn’t it? So when I did this originally on my home computer, it went from like 4.5 to 3.9. So possibly I got a very bad luck this time. So this is the first time I’ve actually ever seen TTA give a worse result. So that’s very weird. I wonder if it’s, if I should do something other than the crop padding.

[01:34:04]

All right. I’ll have to check that out and I’ll try and come back to you and find out why in this case this one was worse.

TTA: Improving Results

Anyway. Take my word for it. Every other time I’ve tried it, TTA has been better. So then, you know, now that we’ve got a pretty good way of resizing, we’ve got TTA, we’ve got a good training process, let’s just make bigger images.

Rectangular Images

And something that’s really interesting and a lot of people don’t realize is your images don’t have to be square. They just all have to be the same size. And given that nearly all of our images are 640 by 480, we can just pick, you know, that aspect ratio. So for example, 256 by 192, and we’ll resize everything to the same aspect ratio rectangular. And that should work even better still. So if we do that, we’ll do 12 epochs.

[01:35:01]

Resizing to Rectangular Images

Okay. Now our error rate’s down to 2.2%. And then we’ll do TTA. Okay. This time you can see it’s actually improving down to under 2%. So that’s pretty cool, right? We’ve got our error rate at the start of this notebook, we were at 12%. And by the time we’ve got through our little experiments, we’re down to under 2%. And nothing about this is in any way specific to Rice or this competition. You know, it’s like, this is a very mechanistic, you know, standardized approach, which you can use for certainly any kind of this type of computer vision competition, computer vision dataset almost.

[01:36:00]

Standardized Approach

But, you know, it would look very similar for a collaborative filtering model, a tabular model, NLP model. So of course, again, I want to submit as soon as I can.

Submitting to Kaggle

So just copy and paste the exact same steps I took last time basically for creating a submission. So as I said, last time we did it using pandas, but there’s actually an easier way.

Mapping Numbers to Strings

So the step where here I’ve got the numbers from naught to nine, which is like, which, which Rice disease is it? So here’s a cute idea. We can take our vocab and make it an array. So that’s going to be a list of 10 things. And then we can index into that vocab with our indices, which is kind of weird. This is a list of 10 things. This is a list of, I don’t know, four or 5,000 things. So this will give me four or 5,000 results, which is each vocab item for that thing. So this is another way of doing the same mapping.

[01:37:00]

And I would spend time playing with this code to understand what it does, because it’s the kind of like very fast, what, you know, not just in terms of writing, but this, this, this would optimize, you know, on, on the CPU very, very well. So this is the kind of coding you want to get used to.

Submitting to Kaggle

This kind of indexing. Anyway, so then we can submit it just like last time. And when I did that, I got in the top 25%. And that’s, that’s where you want to be, right? Like generally speaking, I find in Kaggle competitions, the top 25% is like, you’re kind of like solid competent level, you know, not to say like, it’s not easy. You got to know what you’re doing. But if you get in the top 25%, I think you can really feel like, yeah, this is, this is a, you know, very reasonable attempt.

[01:38:03]

And so that’s, I think this is a very reasonable attempt.

Wrapping Up

Okay. Before we wrap up, John, any last questions?

[John]

TTA During Training

Yeah, there’s, there’s, there’s two, I think, that would be good if we could touch on quickly before you wrap up. One from Victor asking about TTA. When I use TTA during my training process, do I need to do something special during inference? Or is this something you use only during valid data?

[Jeremy Howard]

Okay. So just to explain, TTA means test time augmentation. So specifically, it means inference. So I think you mean augmentation during training. So yeah. So during training, you basically always do augmentation, which means you’re varying each image slightly so that the model never seems the same image exactly the same twice. And so it can’t memorize it. On Fast.ai, and as I say, I don’t think anybody else does this as far as I know, if you call TTA, it will use the exact same augmentation approach on whatever data set you pass it and average out the prediction, but, but like multiple times on the same image and we’ll average them out.

[01:39:16]

So you don’t have to do anything different, but if you didn’t have any data augmentation in training, you can’t use TTA.

Data Augmentation

It uses the same by default, the same data augmentation you use for training.

[John]

Great. Thank you. And the other one is about how, you know, when you first started this example, you squared the models and the images rather, and you talked about squashing versus cropping versus clipping and scaling and so on. But then you went on to say that these models can actually take rectangular inputs. So there’s a question that’s kind of probing it at, at, at that, you know, if the, if the models can take rectangular inputs, why would you ever even care as long as they’re all the same size?

[01:40:02]

[Jeremy Howard]

Rectangular Inputs

So I find most of the time data sets tend to have a wide variety of input sizes and aspect ratios. So, you know, if there’s just as many tall skinny ones as wide short ones, you know, it doesn’t make sense to create a rectangle because some of them you’re going to really destroy them. So a square is the kind of best compromise in some ways. There are better things we can do, which we don’t have any off the shelf library support for yet. And I don’t think, I don’t know that anybody else has even published about this, but we’ve experimented with kind of trying to batch things that are similar aspect ratios together and use the kind of median rectangle for those and have had some good results with that. But honestly, 99.999% of people, given a wide variety of aspect ratios, chuck everything into a square.

[01:41:07]

[John]

Padding with Black Pixels

A follow up, this is my own interest. Have you ever looked at, you know, so the issue with, with padding, as you say, is that you’re putting black pixels there. Those are not NANDs. Those are black pixels. And so there’s something problematic to me, you know, conceptually about that. You know, when you, when you see, for example, four to three aspect ratio footage presented for broadcast on 16 to nine, you get the kind of blurred, stretched kind of stuff.

[Jeremy Howard]

Reflection Padding

No, we’ve played with that a lot. I used to be really into it actually. And fast.ai still by default uses reflection padding, which means if this is, I don’t know, let’s say this is a 20 pixel wide thing. It takes the 20 pixels next to it and flips it over and sticks it here. And it looks pretty good. You know, another one is copy, which simply takes the outside pixel and it’s a bit more like TV.

[01:42:04]

Padding: Impact on Results

You know, much to my chagrin, it turns out none of them really help. You know, if anything, they make it worse. Because in the end, the computer wants to know, no, this is the end of the image. There’s nothing else here. And if you reflect it, for example, then you’re kind of creating weird spikes that didn’t exist. And the computer’s going to be like, oh, I wonder what that spike is. So, yeah, it’s a great question. And I obviously spent like a couple of years assuming that we should be doing things that look more image like. But actually, the computer likes things to be presented to it in as straightforward a way as possible. All right. Thanks, everybody. And hope to see some of you in the walkthroughs. Otherwise, see you next time.