Lesson 7: Practical Deep Learning for Coders 2022

[Jeremy Howard]

Lesson 7: Inside a Neural Net

Right, welcome to lesson seven, the penultimate lesson of Practical Deep Learning for Coders part one. And today we’re going to be digging into what’s inside a neural net. We’ve already seen what’s inside a kind of the most basic possible neural net, which is a sandwich of fully connected layers or linear layers and and relu’s. And so we built that from scratch. But there’s a lot of tweaks that we can do. And so most of the tweaks actually that we probably care about are the tweaking the very first layer or the very last layer. So that’s where we’ll focus. But over the next couple of weeks, we’ll look at some of the tweaks we can do inside as well.

[00:01:05]

Patty, Rice Patty Competition

So I’m going to do this through the lens of the patty, rice patty competition we’ve been talking about. And we got to a point where, let’s have a look. So we created a conf next model. We tried a few different types of basic pre-processing. We added test time augmentation. And then we scaled that up to larger images and rectangular images. And that got us into the top 25% of the competition.

[00:02:03]

So that’s part two of the so-called road to the top series, which is increasingly misnamed. Since we’ve been presenting these notebooks, more and more of our students have been passing me on the leaderboard. So currently, first and second place are both people from this class, Kurian and Nick. Go to hell, you’re in my target, and leave my class immediately. And congratulations, good luck to you.

Scaling Up Models

So in part three, I’m going to show you a really interesting trick, a very simple trick for scaling up these models further. What you’ll discover if you’ve tried to use larger models, so you can replace the word small with the word large in those architectures, and try to train a larger model.

[00:03:09]

A larger model has more parameters. More parameters means it can find more tricky little features. And broadly speaking, models with more parameters therefore ought to be more accurate. Problem is that those activations, or more specifically, the gradients that have to be calculated, choose up memory on your GPU. And your GPU is not as clever as your CPU at kind of sticking stuff it doesn’t need right now into virtual memory on the hard drive. When it runs out of memory, it runs out of memory. And it also doesn’t do such a good job as your CPU at kind of shuffling things around to try and find memory. It just allocates blocks of memory, and it stays allocated until you remove them. So if you try to scale up your models to bigger models, unless you have very expensive GPUs, you will run out of space.

[00:04:07]

And you’ll get an error. Something like CUDA, out of memory error.

CUDA Out of Memory Error

So if that happens, first thing I’ll mention is it’s not a bad idea to restart your notebook, because they can be a bit tricky to recover from otherwise. And then I’ll show you how you can use as large a model as you like. Almost as, you know, basically you’ll be able to use a X large model on Kaggle. So let me explain. Now, I want, when you run something on Kaggle, like actually on Kaggle, you’re generally going to be on a 16 gig GPU. And you don’t have to run stuff on Kaggle. You can run stuff on your home computer or paper space or whatever. But sometimes you’ll have, if you want to do Kaggle competitions, sometimes you’ll have to run stuff on Kaggle, because a lot of competitions are what they call code competitions, which is where the only way to submit is from a notebook that you’re running on Kaggle.

[00:05:10]

And then a second reason to run stuff on Kaggle is that, you know, your notebooks will appear, you know, with the leaderboard score on them, and so people can see which notebooks are actually good. And I kind of like, even in things that aren’t code competitions, I love trying to be the person who’s number one on the notebook score leaderboard, because that’s something which, you know, you can’t just work at NVIDIA and use a thousand GPUs and win a competition through a combination of skill and brute force. Everybody has the same nine hour timeout to work with. So I think it’s a good way of keeping the, you know, things a bit more fair. Now, so my home GPU has 24 gig.

[00:06:02]

So I wanted to find out what can I get away with, you know, in 16 gig. And the way I did that is, I think, a useful thing to discuss, because again, it’s all about fast iteration. So I wanted to really quickly find out how much memory will a GPU, will a model use. So there’s a really quick hacky way I can do that, which is to say, okay, for the training set, let’s not use, so here’s the value counts of labels, so the number of each disease. Let’s not look at all the diseases. Let’s just pick one, the smallest one, right? And let’s make that our training set. Our training set is the bacterial panicle blight images. And now I can train a model with just 337 images without changing anything else. Not that I care about that model, but then I can see how much memory it used. It’s important to realize that, you know, each image you pass through is the same size, each batch size is the same size.

[00:07:00]

So training for longer won’t use more memory. So that’ll tell us how much memory we’re going to need. So what I then did was I then tried training different models to see how much memory they used up. Now, what happens if we train a model, so obviously ConfNec small doesn’t use too much memory. So here’s something that reports the amount of GPU memory just by basically printing out CUDA’s GPU processes. And you can see ConfNec small took up four gig. And also, this might be interesting to you, if you then call Python’s garbage collection, gc.collect, and then call PyTorch’s empty cache, that should basically get your GPU back to a clean state of not using any more memory than it needs to when you can start training the next model without restarting the kernel.

[00:08:04]

So what would happen if we tried to train this little model and it crashed with a CUDA out of memory error? What do we do?

Gradient Accumulation

We can use a cool little trick called gradient accumulation. What’s gradient accumulation? So what’s gradient accumulation? Well, I added this parameter to my train method here. So my train method creates my data loaders, creates my learner, and then depending on whether I’m fine-tuning or not, either fits or fine-tunes it. But there’s one other thing it does. It does this gradient accumulation thing. What’s that about? Well, the key step is here. I set my batch size, so that’s the number of images that I pass through to the GPU all at once, to 64, which is my default, divided by, slash slash means integer divide in Python, divided by this number.

[00:09:06]

So if I pass two, it’s going to use a batch size of 32. If I pass four, it’ll use a batch size of 16. Now, that obviously should let me cure any memory problems, use a smaller batch size, but the problem is that now the dynamics of my training are different, right? The smaller your batch size, the more volatility there is from batch to batch. So now your learning rates are all messed up. You don’t want to be messing around with trying to, you know, find a different set of kind of optimal parameters for every batch size for every architecture. So what we want to do is find a way to run just, let’s say, accum is two, accumulate equals two. Let’s say we just want to run 32 images at a time through. How do we make it behave as if it was 64 images?

[00:10:00]

Well, the solution to that problem is to consider our training loop. This is basically the training loop we used from a couple of lessons ago, the one we created manually. We go through each x, y pair in the data loader. We calculate the loss using some coefficients based on that x, y pair, and then we call backward on that loss to calculate the gradients, and then we subtract from the coefficients the gradients times the learning rate, and then we zero out the gradients. I’ve skipped a bit of stuff like the with torch dot no grad thing. Actually, no, I don’t need that because I’ve got dot data. No, that’s it. That should all work fine. I’ve skipped out printing the loss. That’s about it. So here is a variation of that loop where I do not always subtract the gradient times the learning rate. Instead, I go through each x, y pair in the data loader.

[00:11:03]

I calculate the loss. I look at how many images are in this batch. So initially, I start at zero, and this count is going to be 32, say, if I’ve divided the batch size by two. And then if count is greater than 64, I do my gradient, my coefficients update. Well, it’s not. So I skip back to here, and I do this again. And if you remember, there was this interesting subtlety in PyTorch, which is if you call backward again without zeroing out the gradients, then it adds this set of gradients to the old gradients. So by doing these two half size batches without zeroing out the gradients between them, it’s adding them up. So I’m going to end up with the total gradient of a 64 image batch size, but passing only 32 at a time.

[00:12:07]

If I used accumulate equals four, it would go through this four times, adding them up before it subtracted out the coefficients dot grad times learning rate and zeroed it out. If I put in a cube equals 64, it would go through into a single image one at a time. And after 64 passes through, eventually count would be greater than 64, and we would do the update. So that’s gradient accumulation. It’s a very simple idea, which is that you don’t have to actually update your weights every loop through for every mini-batch. You can just do it from time to time. But it has quite significant implications, which I find most people seem not to realize, which is if you look on, like, Twitter or Reddit or whatever, people could say, oh, I need to buy a bigger GPU to train bigger models.

[00:13:11]

But they don’t. They could just use gradient accumulation. And so given the huge price differential between, say, a RTX 3080 and an RTX 3090 Ti, huge price differential, the performance is not that different. The big difference is the memory. So what? Just put in a bit smaller batch size and do gradient accumulation. So there’s actually not that much reason to buy giant GPUs. John?

[John]

Are the results with gradient accumulation numerically identical?

[Jeremy Howard]

They’re numerically identical for this particular architecture. There is something called batch

[00:14:01]

normalization, which we will look at in part two of the course, which keeps track of the moving average of standard deviations and averages, and does it in a mathematically slightly incorrect way as a result of which if you’ve got batch normalization, then it could, it basically will introduce more volatility, which is not necessarily a bad thing, but because it’s not mathematically identical, you won’t necessarily get the same results. ConvNext doesn’t use batch normalization, so it is the same. And in fact, a lot of the models people want to use really big versions of, which is NLP ones, transformers, tend not to use batch normalization, but instead they use something called layer normalization, which, yeah, doesn’t have the same issue. I think that’s probably fair to say. I haven’t thought about it that deeply.

[00:15:01]

In practice, I found adding gradient accumulation for ConvNext has not caused any issues for me. I don’t have to change any parameters when I do it. Any other questions on the forum, John?

[John]

Gradient Accumulation Clarifications

Tamori asking, shouldn’t it be count greater than equal to 64 if BS equals 64? I haven’t.

[Jeremy Howard]

No, I don’t think so. Oh, yeah. So we start at zero, then it’s going to be 32, then it’s going to be, yeah, yeah, probably. You can probably tell I didn’t actually run this code.

[John]

Madhav is asking, does this mean that LRFind is based on the batch size set during the data block?

[Jeremy Howard]

Yeah, so LRFind just uses your data loader’s batch size.

[John]

Edward is asking, why do we need gradient accumulation rather than just using a smaller batch size?

[00:16:02]

And follows up with, how would we pick a good batch size?

[Jeremy Howard]

Well, just if you use a smaller batch size, here’s the thing, right? Different architectures have different amounts of memory, you know, which they take up. And so you’ll end up with different batch sizes for different architectures, which is not necessarily a bad thing, but each of them is going to then need a different learning rate and maybe even different weight decay or whatever. Like the kind of the settings that’s working really well for batch size 64 won’t necessarily work really well for batch size 32. And, you know, you want to be able to experiment as easily and quickly as possible. I think the second part of your question was how do you pick optimal batch size? Honestly, the standard approach is to pick the largest one you can. Just because it’s faster that way, you’re getting more parallel processing going on.

[00:17:04]

Although to be honest, I quite often use batch sizes that are quite a bit smaller than I need, because quite often it doesn’t make that much difference. But yeah, the rule of thumb would be, you know, pick a batch size that fits in your GPU. And for performance reasons, I think it’s generally a good idea to have it be a multiple of eight. Everybody seems to always use powers of two, I don’t know, like, I don’t think it actually matters.

[John]

Learning Rate Scaling

And look, there’s one other just a clarification or a check if the learning rate should be scaled according to the batch size.

[Jeremy Howard]

Yeah, so generally speaking, the rule of thumb is that if you divide the batch size by two, you divide the learning rate by two. But unfortunately, it’s not quite perfect. Did you have a question, Nick? If you do, you can. Okay, cool.

[John]

Yeah. Now that’s us all caught up.

Gradient Accumulation in Fast AI

Thanks, Jeremy.

[Jeremy Howard]

Good questions. Thank you.

[00:18:02]

So gradient accumulation in fast AI is very straightforward. You just divide the batch size by however much you want to divide it by. And then you add something called a callback. And a callback is something which changes the way the model trains. This callback is called gradient accumulation. And you pass in the effective batch size you want. And then you say, when you create the learner, you say, these are the callbacks I want. And so it’s going to pass in gradient accumulation callback. So it’s going to only update the weights once it’s got 64 images. So if we pass in a Qm equals one, it won’t do any gradient accumulation. And that uses four gig. If we use Qm equals two, about three gig. Qm equals four, about two and a half gig.

[00:19:01]

And generally, the bigger the model, the closer you’ll get to a kind of a linear scaling, because models have a kind of a bit of overhead that they have anyway.

Training Different Models

So what I then did was I just went through all the different models I wanted to try. So I wanted to try ConvNex large, add a 320 by 240, VIT large, SWIN V2 large, SWIN large. And on each of these, I just tried running it with a Qm equals one. And actually, every single time for all of these, I got a amount of memory error. And then I tried each of them independently with Qm equals two. And so it turns out that all of these worked with a Qm equals two. And it only took me 12 seconds each time. So that was a very quick thing for me to then know, okay, I now know how to train all of these models on a 16 gigabyte card. So I can check here, they’re all in less than 16 gig. So then I just created a little dictionary of all the architectures I wanted.

[00:20:03]

And for each architecture, all of the resize methods I wanted and final sizes I wanted. Now, these models, VIT, SWIN V2 and SWIN are all transformers models, which means that, well, most transformers models, nearly all of them have a fixed size. This one’s 192, this one’s 224. So I have to make sure that my final size is a square of the size required. Otherwise, I get an error. There is a way of working around this. But I haven’t experimented with it enough to know when it works well and when it doesn’t. So we’ll probably come back to that in part two. So for now, we’re just going to use the size that they asked us to use. So with this dictionary of architectures and for each architecture, kind of pre-processing details, we switch the training path back to using all of our images.

[00:21:03]

And then we can loop through each architecture and loop through each item transforms and sizes and train the model. And then the training script, if you’re fine-tuning, returns the TTA predictions. So I append all those TTA predictions for each model, for each type, into a list. And after each one, it’s a good idea to do this garbage collection and empty cache, because otherwise I find what happens is your GPU memory kind of, I don’t know, I think gets fragmented or something. And after a while, it runs out of memory, even when you thought it wouldn’t. So this way, you can really do as much as you like without running out of memory.

[00:22:03]

Ensemble of Models

So they all train, train, train, train. And one key thing to note here is that in my train script, my data loaders does not have the seed equals parameter. So I’m using a different training set every time. So that means that for each of these different runs, they’re using also different validation sets. So they’re not directly comparable, but you can kind of see they’re all doing pretty well, 2.1 percent, 2.3 percent, 1.7 percent, and so forth. So why am I using different training and validation sets for each of these? That’s because I want to ensemble them. So I’m going to use bagging, which is I am going to take the average of their predictions.

[00:23:06]

Now, I mean, really, when we talked about random forest bagging, we were taking the average of, like, intentionally weak models. These are not intentionally weak models. They’re meant to be good models, but they’re all different. They’re using different architectures and pre-processing approaches. And so in general, we would hope that these different approaches, some might work well for some images and some might work well for other images. And so when we average them out, hopefully we’ll get a good blend of kind of different ideas, which is kind of what you want in bagging. So we can stack up that list of different, of all the different probabilities and take their mean. And so that’s going to give us 3,469 predictions. That’s our test set size. And each one has 10 probabilities, the probability of each disease.

[00:24:00]

And so then we can use argmax to find which probability index is the highest. So that’s going to give us our list of indexes. So this is basically the same steps as we used before to create our CSV submission file. So at the time of creating this analysis, that got me to the top of the leaderboard. And in fact, these are my four submissions and you can see each one got better. Now you’re not always going to get this nice monotonic improvement, right? But you want to be trying to submit something every day to kind of like try out something new, right? And the more you practice, the more you’ll get a good intuition of what’s going to help, right? So partly I’m showing you this to say, it’s not like purely random as to whether things work or don’t. Once you’ve been doing this for a while, you know, you will generally be improving things most of the time.

[00:25:01]

So as you can see from the descriptions, my first submission was our confnex small, the 12 epochs with TTA. And then an ensemble of confnex. So it’s basically this exact same thing, but just retraining a few with different training subsets. And then this is the same thing again. This is the thing we just saw, basically. The ensemble of large bottles with TTA. And then the last one was something I skipped over, which was the VIT models were the best in my testing. So I basically weighted them as double in the ensemble. And pretty unscientific, but again, it gave it another boost. And so that was, that was it.

[John]

K-Fold Cross-Validation

All right, John. Uh, yes, thanks, Jeremy. Uh, so in no particular order, Kurian is asking, would trying out cross-validation with k-folds with the same architecture make sense?

[00:26:04]

[Jeremy Howard]

Okay, so a popular thing is to do k-fold cross-validation. So k-fold cross-validation is something very, very similar to what I’ve done here. So what I’ve done here is I’ve trained a bunch of models, um, with, uh, different training sets. Each one is a different random 80% of the data. Um, five-fold cross-validation does something similar, but what it says is rather than picking, like, say, five samples out with different random subsets, in fact, uh, instead, uh, first, like, do all except for the first 20% of the data, and then all but the second 20%, and then all but the third, third, and so forth. And so you end up with five subsets, each of which have non-overlapping validation sets. Um, and then you’ll ensemble those.

[00:27:00]

Um, you know, in a, in theory, maybe that could be slightly better because you’re kind of guaranteed that every row is, appears four times, you know, effectively. Um, it also has a benefit that you could average those five validation sets because there’s no kind of overlap between them to get a, uh, a cross-validation. Personally, I generally don’t bother. Um, and the reason I don’t is because, um, this way, I can add and remove models very easily. Um, I don’t, you know, I, I can just, you know, add another architecture and whatever to my ensemble without trying to find a different overlapping, non-overlapping subset. So, um, yeah, cross-validation is therefore something that I use probably less than most people or almost, or almost never.

[00:28:01]

[John]

Drawbacks of Gradient Accumulation

Awesome. Thank you. Um, are there any, just going back to gradient accumulation, any other kind of drawbacks or potential gotchas with gradient accumulation?

[Jeremy Howard]

No, not really. Um, yeah, like amazingly, um, it doesn’t even really slow things down much, you know, going from a batch size of 64 to a batch size of 32. By definition, you had to do it because your GPU’s full. So you’re obviously giving a lot of data. So it’s probably going to be using its processing speed pretty effectively. So yeah, um, no, it’s just, it’s just a good technique that we should all be buying cheaper graphics cards with less memory in them and using, you know, have, like, I don’t know the prices. I suspect, you could probably buy like two 3080s for the price of one 3090 Ti or something. Um, that would be a very good deal.

[John]

GPU Recommendations

Uh, yes, clearly you’re not on the Nvidia payroll.

[00:29:01]

So look, this is a good segue then. We did have a question about sort of GPU recommendations and there’s been a bit of chat on that as well. I bet. Um, so any, any, you know, commentary, any additional commentary around GPU recommendations?

[Jeremy Howard]

No, not really. I mean, obviously, at the moment, Nvidia is the only game in town, you know, if you buy, if you’re trying to use a, you know, Apple M1 or M2 or, or an AMD card, you’re basically in for a world of pain in terms of compatibility and stuff and unoptimized libraries and whatever. Um, the, the Nvidia, um, consumer cards. So the ones that start with RTX are much cheaper, um, but are just as good as the expensive, um, enterprise cards. So you might be wondering why anybody would buy the expensive enterprise cards.

[00:30:00]

And the reason is that there’s a licensing issue that Nvidia will not allow you to use an RTX consumer card in a data center, which is also why cloud computing is more expensive than it kind of ought to be because everybody selling cloud computing GPUs is selling these cards that are like, I can’t remember. I think they’re like three times more expensive for kind of the same features. Um, so yeah, if you do get serious about deep learning to the point that you’re prepared to invest, you know, a few days in administering a box and, you know, I guess it depends, you know, prices hopefully will start to come down, but currently a thousand or $2,000, a thousand or $2,000 on buying a GPU, then, you know, that’ll probably pay you back pretty quickly.

[John]

Great. Thank you. Um, let’s see, another one’s come in.

Teacher-Student Models

Uh, if you have a, back on models, not hardware, if you have a well-functioning but large model, can it make sense to train a smaller model to produce the same final activations as the larger model?

[00:31:09]

[Jeremy Howard]

Oh yeah, absolutely. I’m not sure we’ll get into that this time around, but, um, yeah, um, we’ll cover that in part two, I think, but yeah, basically there’s a teacher student models and model distillation, which broadly speaking, there, there are ways to make inference faster by training small models that work the same way as large models. Great. Thank you. All come up.

Road to the Top Conclusion

All right. So that is the actual real end of road to the top, because beyond that, we don’t actually cover how to get closer to the top.

Multi-Target Model

You’d have to ask Kurian to share his techniques to find out that or Nick to get the second place in the top. Um, part four is actually, um, something that I think is very useful to know about for, for learning.

[00:32:01]

And it’s going to teach us a whole lot about how the last layer of a neural networks. And specifically what we’re going to try to do is we’re going to try to build a model that doesn’t just predict the disease, but also predicts the type of rice. So how would you do that?

Data Loader with Two Dependent Variables

So here’s the data loader we’re going to try to build. It’s going to be something that for each image, it tells us the disease and the type of rice. I say disease, sometimes normal, I guess some of them are not diseased. So to build a model that can predict two things. The first thing is you’re going to need data loaders that have two dependent variables. And that is shockingly easy to do in fast AI, um, thanks to the data block. So we’ve seen the data block before.

[00:33:02]

We haven’t been using it for the patty competition so far because we haven’t needed it. We could just use a image data loader dot from folder. So that’s like the highest level API, the simplest API. If we go down a level deeper into the data block, we have a lot more flexibility. So if you’ve been following the walkthroughs, you’ll know that as I built this, the first thing I actually did was to simply replicate the previous notebook, but replace the image data loader dot from folders with a data block to try to do first of all, exactly the same thing. And then I added the second dependent variable. So if we look at the previous image data loader from folders thingy, here it is, we are passing in some item transforms and some batch transforms. And we had something saying what percentage should be the validation set.

[00:34:06]

So in a data block, if you remember, we have to pass in a blocks argument saying what kind of data is the independent variable and what is the dependent variable. So to replicate what we had before, we would just pass in image block comma category block because we’ve got an image as our independent variable and a category, one type of rice, as the dependent variable. So the new thing I’m going to show you here is that you don’t have to only put in two things. You can put in as many as you like. So if you put in three things, we’re going to generate one image and two categories. Now fast.ai, if you’re saying I want three things, fast.ai doesn’t know which of those is the independent variable and which is the dependent variable. So the next thing you have to tell it is how many inputs are there, number of inputs. And so here I’ve said there’s one input. So that means this is the input. And therefore, by definition, two categories will be the output.

[00:35:01]

Because remember, we’re trying to predict two things, the type of rice and the disease. Okay, this is the same as what we’ve seen before. To find out, to get our list of items, we’ll call getImageFiles. Now here’s something we haven’t seen before. getY is our labeling function. Normally we pass to getY a single thing, such as the parent label function, which looks at the name of the parent directory, which remember is how these images are structured. And that would tell us the label. But getY can also take an array. And in this case, we want two different labels. One is the name of the parent directory, because that’s the disease. The second is the variety.

getVariety Function

So what’s getVariety? getVariety is a function. So let me explain how this function works. So we can create a data frame containing our trainings, our training data that came from Kaggle.

[00:36:00]

So for each image, it tells us the disease and the variety. And what I did is something I haven’t shown before. In pandas, you can set one column to be the index. And when you do that, in this case imageId, it makes this series, this data frame, kind of like a dictionary. I can index into it by saying, tell me the row for this image. And to do that, you use the lock attribute, the location. So we want in the data frame, the location of this image. And then you can also say optionally what column you want, this column. And so here’s this image. And here’s this column. And as you can see, it returns that thing. So hopefully now you can see it’s pretty easy for us to create a function that takes a row, sorry, a path, and returns the location in the data frame of the name of that file.

[00:37:10]

Because remember, these are the names of files for the variety column. So that’s our second gateway. Okay. And then we’ve seen this before, randomly split the data into the 20% and 80%. And so we just switch them all to 192, just for this example. And then use data augmentation to get us down to 128 square images, just for this example. And so that’s what we get when we say show batch. We get what we just discussed. So now we need a model that predicts two things.

Model Predicting Two Things

How do we create a model that predicts two things?

[00:38:02]

Well, the key thing to realize is we never actually had a model that predicts two things. We had a model that predicts 10 things before. The 10 things we predicted is the probability of each disease. So we don’t actually now want a model that predicts two things. We want a model that predicts 20 things. The probability of each of the 10 diseases and the probability of each of the 10 varieties. So how could we do that? Well, let’s first of all try to just create the same disease model we had before with our new data loader. So this is going to be reasonably straightforward. The key thing to know is that since we told Fast.ai that there’s one input, and therefore by definition there’s two outputs, it’s going to pass to our metrics and to our loss functions three things instead of two.

[00:39:09]

Metrics and Loss Functions

The predictions from the model and the disease and the variety. So we can’t just use error rate as our metric anymore because error rate takes two things. Instead, we have to create a function that takes three things and return error rate on the two things we want, which is the predictions from the model and the disease. Okay, so there’s predictions in the model. This is the target. So that’s actually all we need to do to define a metric that’s going to work with our new data set, with our new data loader. This is not going to actually tell us anything about variety. First, it’s going to try to replicate something that can do just disease. So when we create our learner, we’ll pass in this new disease error function.

[00:40:00]

Okay, so we’re halfway there. The other thing we’re going to need is to change our loss function. Now, we never actually talked about what loss function to use, and that’s because VisionLearner guessed what loss function to use. VisionLearner saw that our dependent variable was a single category, and it knows the best loss function that’s probably going to be the case for things with a single category, and it knows how big the category is. So it just didn’t bother us at all. It just said, okay, I’ll figure it out for you. So the only time we’ve provided our own loss function is when we were kind of doing linear models and neural nets from scratch. And we did, I think, mean squared error. We might also have done mean absolute error. Neither of those work when the dependent variable is a category. Now, how would you use mean squared error or mean absolute error to say how close were these 10 probability predictions to this one correct answer?

[00:41:10]

So in this case, we have to use a different loss function.

Cross-Entropy Loss

We have to use something called cross-entropy loss. And this is actually the loss function that Fast.ai picked for us before without us knowing. But now that we are having to pick it out manually, I’m going to explain to you exactly what cross-entropy loss does. Okay? And, you know, these details are very important indeed. Like, remember I said at the start of this class, the stuff that happens in the middle of the model, you’re not going to have to care about much in your life, if ever. But the stuff that happens in the first layer and the last layer, including the loss function that sits between the last layer and the loss, you’re going to have to care about a lot. Right? This stuff comes up all the time. So you definitely want to know about cross-entropy loss.

[00:42:00]

And so I’m going to explain it using a spreadsheet. This spreadsheet’s in the course repo. And so let’s say you were predicting something like a kind of a mini image net thing, where you’re trying to predict whether something, an image is a cat, a dog, a plane, a fish, or a building. So you set up some model, whatever it is, a convex model, or just a big bunch of linear layers connected up, or whatever. And initially you’ve got some random weights, and it spits out at the end five predictions. Right? So remember to predict something with five categories, your model will spit out five probabilities. Now it doesn’t initially spit out probabilities. There’s nothing making them probabilities. It just spits out five numbers. Could be negative, could be positive. Okay? So here’s the output of the model. So what we want to do is we want to convert these into probabilities.

[00:43:07]

And so we do that in two steps. The first thing we do is we go exp, that’s e to the power of. We go e to the power of each of those things. Like so. Okay? And so here’s the mathematical formula we’re using. This is called the softmax, what we’re working through. We’re going to go through each of the categories. So these are our five categories. So here k is five. We’re going to go through each of our categories. And we’re going to go e to the power of the output. So zj is the output for the jth category. So here’s that. And then we’re going to sum them all together. Here it is, sum up together. Okay? So this is the denominator.

[00:44:01]

And then the numerator is just e to the power of the thing we care about. So this row. So the numerator is e to the power of cat on this row, e to the power of dog on this row, and so forth. Now if you think about it, since the denominator adds up all the e to the power ofs, then when we do each one divided by the sum, that means the sum of these will equal one, by definition. Right? And so now we have things that can be treated as probabilities. They’re all numbers between zero and one. Numbers that were bigger in the output will be bigger here. But there’s something else interesting, which is because we did e to the power of, it means that the bigger numbers will be like pushed up to numbers closer to one.

[00:45:00]

Like we’re saying, like, oh, really try to pick one thing as having most of the probability. Because we are trying to predict, you know, one thing. We’re trying to predict which one is it. And so this is called softmax. So sometimes you’ll see people complaining about the fact that their model, which they said, let’s say, is it a teddy bear or a grizzly bear or a black bear? And they feed it a picture of a cat. And they say, oh, the model’s wrong, because it predicted grizzly bear. But it’s not a grizzly bear. As you can see, there’s no way for this to predict anything other than the categories we’re giving it. We’re forcing it to that. Now we don’t, if you want, like, there’s something else you could do, which is you could actually have them not add up to one. Right? You could instead have something which simply says, what’s the probability it’s a cat? What’s the probability it’s a dog? What’s the totally separately? And they could add up to less than one.

[00:46:01]

And in that situation, you can, you know, or more than one, in which case you could have, like, more than one thing being true or zero things being true. But in this particular case, where we want to predict one and one thing only, we use softmax.

Softmax

The first part of the cross-entropy formula, the first part of the cross-entropy formula, in fact, let’s look it up. nn dot cross-entropy loss. The first part of what cross-entropy loss in PyTorch does is to calculate the softmax. It’s actually the log of the softmax, but don’t worry about that too much. It’s just a slightly faster to do the log. Okay. So now for each one of our five things, we’ve got a probability.

[00:47:03]

The next step is the cross-entropy calculation, which is we take our five things, we’ve got our five probabilities, and then we’ve got our actuals. Now, the truth is the actual, you know, the five things would have indices, right? Zero, one, two, three, or four. And the actual turned out to be the number one. But what we tend to do is we think of it as being one-hot encoded, which is we put a one next to the thing for which it’s true, and a zero everywhere else. And so now we can compare these five numbers to these five numbers, and we would expect to have a smaller loss if the softmax was high where the actual is high. Okay. And so here’s how we calculate, this is the formula, the cross-entropy loss.

[00:48:00]

We sum up, so switch to m this time for some reason, but it’s the same thing. We sum up across the five categories, so m is five. And for each one, we multiply the actual target value, so that’s zero. So here it is here, the actual target value. And we multiply that by the log of the predicted probability, the log of red, the predicted probability. And so, of course, for four of these, that value is zero. Because see here, yj equals zero, by definition, for all but one of them, because it’s one-hot encoded. So for the one that it’s not, we’ve got our actual times the log softmax. Okay.

[00:49:00]

Cross-Entropy Loss Calculation

And so now actually you can see why PyTorch prefers to use log softmax, because then it kind of skips over having to do this log at all. So this equation looks slightly frightening, but when you think about it, all it’s actually doing is it’s finding the probability for the one that is one and taking its log. Right? It’s kind of weird doing it as a sum, but in math, it can be a little bit tricky to kind of say, oh, look this up in an array, which is basically all it’s doing. But yeah, basically, at least in this case, a single result with softmax, this is all it’s doing. It’s finding the 0.87 where it’s 1, 4, and taking the log, and then finally negative. So that is what cross-entropy loss does. We add that together for every row.

[00:50:03]

So here’s what it looks like if we add it together over every row. Right? So n is the number of rows. And here’s a special case. This is called binary cross-entropy. What happens if we’re not predicting which of five things it is, but we’re just predicting, is it a cat? So in that case, if you look at this approach, you end up with this formula, which this is identical to this formula, but for just two cases, which is you either are a cat or you’re not a cat. Right? And so if you’re not a cat, it’s 1 minus you are a cat. And same with the probability. You’ve got the probability you are a cat, and then not a cat is 1 minus that. So here’s this special case of binary cross-entropy. And now our rows represent rows of data. Okay? So each one of these is a different image, a different prediction.

[00:51:02]

And so for each one, I’m just predicting, are you a cat? And this is the actual. And so the actual, are you not a cat, is just 1 minus that. And so then these are the predictions that came out of the model. Again, we can use Softmax or its binary equivalent. And so that will give you a prediction that you’re a cat. And the prediction that it’s not a cat is 1 minus that. And so here is each of the part, yi times log of pyi. And here is, why did I subtract? That’s weird. Oh, because I’ve got minus of both, so I’ll just do it this way, avoids parentheses. Yeah, minus the, are you not a cat, times the log of the prediction of are you not a cat.

[00:52:01]

And then we can add those together. And so that would be the binary cross-entropy loss of this data set of five cat or not cat images.

Binary Cross-Entropy

Now, if you’ve got an eagle eye, you may have noticed that I am currently looking at the documentation for something called nn.crossentropy loss. But over here, I had something called f.crossentropy. Basically, it turns out that all of the loss functions in PyTorch have two versions. There’s a version which is a class. This is a class, which you can instantiate, passing in various tweaks you might want. And there’s also a version which is just a function. And so if you don’t need any of these tweaks, you can just use the function.

[00:53:04]

Loss Function Versions

The functions live in a, I kind of remember what the sub-module called, I think it might be like torch.nn.functional, but everybody, including the PyTorch official docs, just calls it capital F. So that’s what this capital F refers to. So our loss, if we just care about disease, we’re going to be past the three things. We’re just going to calculate cross-entropy on our input versus disease. All right. So that’s all fine. So now when we create a vision learner, you can’t rely on fast.ai to know what loss function to use because we’ve got multiple targets. So you have to say, this is the loss function I want to use. This is the metrics I want to use. And the other thing you can’t rely on is that fast.ai no longer knows how many activations to create, because again, there’s more than one target. So you have to say the number of outputs to create at the last layer is 10.

[00:54:00]

So this is just saying, what’s the size of the last matrix? And once we’ve done that, we can train it and we get basically the same kind of result as we always get, because this model at this point is identical to our previous convex small model. We’ve just done it in a slightly more roundabout way.

Multi-Target Model

So finally, before our break, I’ll show you how to expand this now into a multi-target model. And the trick is actually very simple. And you might have almost got the idea of it when I talked about it earlier. Our vision learner now requires 20 outputs. We now need that last matrix to produce 20 activations, not 10. 10 of those activations are going to predict the disease, and 10 of the activations are going to predict the variety.

[00:55:00]

So you might be then asking like, well, how does the model know what it’s meant to be predicting? And the answer is, with the loss function, you’re going to have to tell it. So for example, disease loss, remember, it’s going to get the input, the disease, and the variety. This is now going to have 20 columns in. So we’re just going to decide, all right, we’re just going to decide the first 10 columns, we’re going to decide the prediction of what the disease is, the probability of each disease. So we can now pass to cross-entropy the first 10 columns and the disease target. So the way you read this, colon means every row, and then colon 10 means every column up to the 10th. So these are the first 10 columns. And that will, that’s a loss function that just works on predicting disease using the first 10 columns.

[00:56:03]

For variety, we’ll use cross-entropy loss with the target of variety. And this time we’ll use the second 10 columns. So here’s column 10 onwards. So then the overall loss function is the sum of those two things, disease loss plus variety loss. And that’s actually it. That’s all the model needs to basically, it’s now going to, if you kind of think through the manual neural nets we’ve created, this loss function will be reduced when the first 10 columns are doing a good job of predicting the disease probabilities, and the second 10 columns are doing a good job of predicting the variety probabilities. And therefore the gradients will point in an appropriate direction that the coefficients will get better and better at using those columns for those purposes.

[00:57:03]

Error Rate for Disease and Variety

It would be nice to see the error rate as well for each of disease and variety. So we can call error rate passing in the first 10 columns in disease, and then variety the second 10 columns in variety. And we may as well also add to the metrics the losses. And so now when we create our learner, we’re going to pass in as the loss function the combined loss. And as the metrics, our list of all the metrics, and n out equals 20. And now look what happens when we train. As well as telling us the overall train in valid loss, it also tells us the disease and variety error, and the disease and variety loss. And you can see our disease error is getting down to similar levels it was before. It’s slightly less good, but it’s similar. It’s not surprising it’s slightly less good, because we’ve only given it the same number of epochs, and we’re now asking it to try to do more stuff, which is to learn to recognize what the rice variety looks like, and also learns to recognize what the disease looks like.

[00:58:15]

Multi-Target Model Performance

Here’s the counterintuitive thing though. If we train it for longer, it may well turn out that this model, which is trying to predict two things, actually gets better at predicting disease than our disease-specific model. Why is that? Like, that sounds weird, right? Because we’re trying to get it to do more stuff, and it’s the same size model. Well, the reason is that quite often it’ll turn out that the kinds of features that help you recognize a variety of rice are also useful for recognizing the disease. You know, maybe there are certain textures, right? Or maybe some diseases impact different varieties in different ways.

[00:59:01]

So it’d be really helpful to know what variety it was. I haven’t tried training this for a long time, and I don’t know the answer. In this particular case, does a multi-target model do better than a single-target model at predicting disease? But I just wanted to let you know sometimes it does. For example, a few years ago, there was a Kaggle competition for recognizing the kinds of fish on a boat, and I remember we ended up doing a multi-target model where we tried to predict a second thing. I can’t even remember what it was. Maybe it was a type of boat or something, and it definitely turned out in that Kaggle competition that predicting two things helped you predict the type of fish better than predicting just the type of fish. So there’s at least, you know, there’s two reasons to learn about multi-target models. One is that sometimes you just want to be able to predict more than one thing. So this is useful. And the second is that sometimes this will actually be better at predicting just one thing than a just-one-thing model.

[01:00:03]

Reasons to Learn Multi-Target Models

And of course, the third reason is it really forced us to dig quite deeply into these loss functions and activations in a way we haven’t quite done before. So it’s okay. It’s absolutely okay if this is confusing. The way to make it not confusing is, well, the first thing I do is, like, go back to our earlier models where we did stuff by hand on, like, the Titanic data set and built our own architectures. And maybe you could try to build a model that predicts two things in the Titanic data set. Maybe you could try to predict both sex and survival or something like that, or class and survival. Because that’s going to kind of force you to look at it on very small data sets. And then the other thing I’d say is run this notebook and really experiment at trying to see what kind of outputs you get.

[01:01:08]

Like, actually look at the inputs and look at the outputs and look at the data loaders and so forth.

Break

All right. Let’s have a six-minute break. So I’ll see you back here at ten past seven.

Collaborative Filtering Deep Dive

Okay. Welcome back. Oh, before I continue, I very rudely forgot to mention this very nice equation image here is from an article by Chris Sedd called Things That Confused Me About Cross Entropy. It’s a very good article. So I recommend you check it out if you want to go a bit deeper there. There’s a link to it inside the spreadsheet. So the next notebook we’re going to be looking at is this one called Collaborative Filtering Deep Dive.

[01:02:02]

Movie Lens Data Set

And this is going to cover our last of the four major application areas, collaborative filtering. And this is actually the first time I’m going to be presenting a chapter of the book largely without variation. Because this is one where I looked back at the chapter and I was like, oh, I can’t think of any way to improve this. So I thought I’ll just leave it as is. But we have put the whole chapter up on Kaggle. So that’s the way I’m going to be showing it to you. And so we’re going to be looking at a data set called the Movie Lens data set, which is a data set of movie ratings. And we’re going to grab a smaller version of it, 100,000 record version of it.

[01:03:03]

And it comes as a CSV file, which we can read in. But it’s not really a CSV file, it’s a TSV file. This here means a tab in Python. And these are the names of the columns. So here’s what it looks like. It’s got a user, a movie, a rating, and a timestamp. We’re not going to use the timestamp at all. So basically three columns we care about. This is a user ID. So maybe 196 is Jeremy, and maybe 186 is Rachel, and 22 is John, I don’t know. Maybe this movie is Return of the Jedi, and this one’s Casablanca, this one’s LA Confidential. And then this rating says, how did Jeremy feel about Return of the Jedi? He gave it a three out of five. That’s how we can read this data set. This kind of data is very common.

[01:04:04]

Collaborative Filtering Data

Any time you’ve got a user and a product or service, and you might not even have ratings, maybe just the fact that they bought that product. You could have a similar table with zeros and ones. So for example, Radek, who’s in the audience here, is now at NVIDIA doing basically just this, right? Recommendation systems. So recommendation systems, it’s a huge industry. And so what we’re learning today is a really key foundation of it. So these are the first few rows. This is not a particularly great way to see it. I prefer to kind of cross-tabulate it like that, like this. This is the same information. So for each movie, for each user, here’s the rating. So user 212, never watched movie 49.

[01:05:00]

Now if you’re wondering why there’s so few empty cells here, I actually grabbed the most watched movies and the most movie watching users for this particular sample matrix. So that’s why it’s particularly full. So yeah, so this is what kind of a collaborative filtering data set looks like when we cross-tabulate it.

Filling in the Gap

So how do we fill in this gap? So maybe user 212 is Nick and movie 49. What’s a movie you haven’t seen, Nick, and you’d quite like to, maybe not sure about it? The new Elvis movie. Baz Luhrmann, good choice. Australian director. Filmed in Queensland. Yeah. Okay. So that’s movie number 49. So is Nick going to like the new Elvis movie?

[01:06:01]

Predicting User Preferences

Well, to figure this out, what we could do ideally, we’d like to know for each movie, what kind of movie is it? Like what are the kind of features of it? Is it like action-y, science fiction-y, dialogue-driven, critical acclaimed, you know? So let’s say for example, we were trying to look at The Last Skywalker. Maybe that was the movie that Nick’s wondering about watching. And so if we like had three categories being science fiction, action, or kind of classic old movies, we’d say The Last Skywalker is very science fiction. Let’s see, this is from like negative one to one. Pretty action, definitely not an old classic, or at least not yet. And so then maybe we then could say like, okay, well, maybe like Nick’s tastes in movies are that he really likes science fiction, quite likes action movies, and doesn’t really like old classics.

[01:07:13]

Right? So then we could kind of like match these up to see how much we think this user might like this movie. To calculate the match, we could just multiply the corresponding values. Use a one times Last Skywalker and add them up. 0.9 times 0.98 plus 0.8 times 0.9 plus negative 0.6 times negative 0.9. That’s going to give us a pretty high number, right? With a maximum of three. So that would suggest Nick probably would like The Last Skywalker. On the other hand, the movie Casablanca, we would say definitely not very science fiction, not really very action, definitely very old classic. So then we’d do exactly the same calculation and get this negative result here.

[01:08:07]

So you probably wouldn’t like Casablanca. This thing here, when we multiply the corresponding parts of a vector together and add them up, is called a dot product in math. So this is the dot product of the user’s preferences and the type of movie. Now the problem is, we weren’t given that information. We know nothing about these users or about the movies. So what are we going to do?

Latent Factors

We want to try to create these factors without knowing ahead of time what they are. We wouldn’t even know what factors to create. What are the things that really matters when people decide what movies they want to watch? What we can do is we can create things called latent factors. Latent factors is this weird idea that we can say, I don’t know what things about movies matter to people, but there’s probably something.

[01:09:10]

And let’s just try like using SGD to find them.

Latent Factors in Excel

And we can do it in everybody’s favorite mathematical optimization software, Microsoft Excel. So here is that table. And what we can do, let’s head over here actually, here’s that table. So what we could do is we could say for each of those movies, so let’s say for movie 27, let’s assume there are five latent factors. I don’t know what they’re They’re just five latent factors. We’ll figure them out later. And for now, I certainly don’t know what the value of those five latent factors for movie 27.

[01:10:05]

So we’re going to just chuck a little random numbers in them. And we’re going to do the same thing for movie 49. Pick another five random numbers. And the same thing for movie 57. Pick another five numbers. And you might not be surprised to hear, we’re going to do the same thing for each user. So for user 14, we’re going to pick five random numbers for them. And for user 29, we’ll pick five random numbers for them. And so the idea is that this number here, 0.19, is saying, if it was true, that user ID 14 feels not very strongly about the fact that for movie 27 has a value of 0.71. So therefore in here, we do the dot product. The details of why don’t matter too much, but well, actually, you can figure this out from what we’ve said so far.

[01:11:01]

Matrix Product and Dot Product

If you go back to our definition of matrix product, you might notice that the matrix product of a row with a column is the same thing as a dot product. And so here in Excel, I have a row and a column. So therefore I say matrix multiply that by that. That gives us the dot product. So here’s the dot product of that by that, or the matrix multiply, given that row and column. The only other slight quirk here is that if the actual rating is 0, is empty, I’m just going to leave it blank. I’m going to set it to 0 actually. So here is everybody’s rating, predicted rating of movies.

Stochastic Gradient Descent

I say predicted, of course, these are currently random numbers, so they are terrible predictions. But when we have some way to predict things, and we start with terrible random predictions, we know how to make them better, don’t we?

[01:12:06]

We use stochastic gradient descent. Now to do that, we’re going to need a loss function. So that’s easy enough. We can just calculate the sum of x minus y squared divided by the count. That is the mean squared error. And if we take the square root, that is the root mean squared error. So here is the root mean squared error in Excel between these predictions and these actuals. And so now that we have a loss function, we can optimize it.

Optimizing the Loss Function

Data, solver, set objective, this one here, by changing cells, these ones here, and these ones here, solve.

[01:13:00]

Okay, and initially our loss is 2.81. So we hope it’s going to go down. And as it solves, not a great choice of background color, but it says 0.68. So this number is going down. So this is using, actually in Excel it’s not quite using stochastic gradient descent, because Excel doesn’t know how to calculate gradients. There are actually optimization techniques that don’t need gradients. They calculate them numerically as they go, but that’s a minor quirk. One thing you’ll notice is it’s doing it very, very slowly. There’s not much data here, and it’s still going. One reason for that is that if it’s, because it’s not using gradients, it’s much slower, and the second is Excel is much slower than PyTorch. Anyway, it’s come up with an answer, and look at that. It’s got to 0.42. So it’s got a pretty good prediction. And so we can kind of get a sense of this.

[01:14:03]

For example, looking at the last three, user 14 likes, dislikes, likes. Let’s see somebody else like that. Here’s somebody else. This person likes, dislikes, likes. So based on our kind of approach, we’re saying, okay, since they have the same feeling about these three movies, maybe they’ll feel the same about these three movies. So this person likes all three of those movies, and this person likes two out of three of them. So, you know, you kind of, this is the idea, right? As if somebody says to you, I like this movie, this movie, this movie, and you’re like, oh, they like those movies too. What other movies do you like? And they’ll say, oh, how about this? There’s a chance, good chance, that you’re going to like the same thing. That’s the basis of collaborative filtering, okay? And mathematically, we call this matrix completion.

[01:15:02]

Matrix Completion

So this matrix is missing values. We just want to complete them. So the core of collaborative filtering is, it’s a matrix completion exercise. Can you grab a microphone?

[Audience Member]

Cosine Similarity and Correlation

My question was, is with the dot products, right? So if we think about the math of that for a minute, is, yeah, if we think about the cosine of the angle between the two vectors, that’s going to roughly approximate the correlation. Is that essentially what’s going on here in one sense?

[Jeremy Howard]

So is the cosine of the angle between the vectors much the same thing as the dot product? The answer is yes. They’re the same once you normalize them. So, yep. Is that still on?

[Audience Member]

It’s correlation, what we’re doing here at scale as well.

[01:16:00]

[Jeremy Howard]

Yeah, you can, yeah, you can think of it that way.

PyTorch Implementation

Now, this looks pretty different to how PyTorch looks. PyTorch has things in rows, right? We’ve got a user, a movie rating. User, movie rating, right? So how do we do the same kind of thing in PyTorch? So let’s do the same kind of thing in Excel, but using the table in the same format that PyTorch has it, okay?

Excel Implementation with PyTorch Format

So to do that in Excel, the first thing I’m going to do is I’m going to see, okay, this, I’ve got to look at user number 14, and I want to know what index, like how far down this list is 14, okay? So we’ll just match means find the index. So this is user index one. And then what I’m going to do is I’m going to say these five numbers is basically I want to find row one over here.

[01:17:05]

And in Excel, that’s called offset. So we’re going to offset from here by one row. And so you can see here it is 0.19, 0.63, 0.19, 0.63, et cetera, right? So here’s the second user, 0.25, 0.83, et cetera. And we can do the same thing for movies, right? So movie 417 is index 14. That’s going to be 0.75, 0.47, et cetera. And so same thing, right? But now we’re going to offset from here by 14 to get this row, which is 0.75, 0.47, et cetera. And so the prediction now is the dot product is called some product in Excel.

Dot Product in Excel

This is some product of those two things.

[01:18:01]

So this is exactly the same as we had before, right? But when we kind of put everything next to each other, we have to like manually look up the index. And so then for each one, we can calculate the error squared prediction minus rating squared. And then we could add those all up. And if you remember, this is actually the same root mean squared error we had before we optimized before, 2.81, because we’ve got the same numbers as before. And so this is mathematically identical. So what’s this weird word up here?

Embedding

Embedding. You’ve probably heard it before, and you might have come across the impression it’s some very complex, fancy mathematical thing. But actually it turns out that it is just looking something up in an array. That is what an embedding is. So we call this an embedding matrix.

[01:19:03]

And these are our user embeddings and our movie embeddings. So let’s take a look at that in PyTorch.

Embedding in PyTorch

And you know, at this point, if you’ve heard about embeddings before, you might be thinking that can’t be it. And yeah, it’s just as complex as the rectified linear unit, which turned out to be replace negatives with zeros. Embedding actually means look something up in an array. So there’s a lot of things that we use as deep learning practitioners to try to make you as intimidated as possible so that you don’t wander into our territory and start winning our Kaggle competitions. And unfortunately, once you discover the simplicity of it, you might start to think that you can do it yourself. And then it turns out you can. So yeah, that’s what basically it turns out pretty much all of this jargon turns out to be.

[01:20:03]

Learning Latent Factors in PyTorch

So we’re going to try to learn these latent factors, which is exactly what we just did in Excel. We just learned the latent factors.

Data Loaders

All right. So if we’re going to learn things in PyTorch, we’re going to need data loaders. One thing I did is there is actually a movies table as well with the names of the movies. So I merged that together with the ratings so that then we’ve now got the user ID and the actual name of the movie. We don’t need that, obviously, for the model, but it’s just going to make it a bit more fun to interpret later. So this is called ratings. We have something called collab data loaders, so collaborative filtering data loaders. And we can get that from a data frame by passing in the data frame. And it expects a user column and an item column. So the user column is what it sounds like, the person that is rating this thing.

[01:21:03]

And the item column is the product or service that they’re rating. In our case, the user column is called user, so we don’t have to pass that in. And the item column is called title, so we do have to pass this in. Because by default, the user column should be called user, and the item column will be called item. Give it a batch size. And as usual, we can call show batch. And so here’s our data loaders, a batch of data loaders, or at least a bit of it. And so now that we’ve dealt with the names, we actually get to see the names, which is nice.

User and Movie Factors

All right, so now we’re going to create the user factors and movie factors, i.e. this one and this one.

[01:22:01]

So the number of rows of the movie factors will be equal to the number of movies, and the number of rows of the user factors will be equal to the number of users. And the number of columns will be whatever we want, however many factors we want to create. John?

[John]

Choosing the Number of Factors

This might be a pertinent time to jump in with a question. Any comments about choosing the number of factors?

[Jeremy Howard]

Not really. We have defaults that we use for embeddings in Fast.ai. It’s a very obscure formula, and people often ask me for the mathematical derivation of where it came from. But what actually happened is I wrote down how many factors I think is appropriate for different size categories on a piece of paper at a table, or actually in Excel, and then I fitted a function to that, and that’s the function. So it’s basically a mathematical function that fits my intuition about what works well.

[01:23:04]

But it seems to work pretty well. I’ve seen it used in lots of other places now. Lots of papers will be like, using Fast.ai’s rule of thumb for embedding sizes, here’s the formula.

[John]

Cool. Thank you.

[Jeremy Howard]

Training Speed

It’s pretty fast to train these things. You can try a few. So we’ve got to create, so the number of users is just the length of how many users there are. Number of movies is the length of how many titles there are. So create a matrix of random numbers of users by five, and movies of movies by five. And now we need to look up the index of the movie in our movie latent factor matrix.

Embedding as Matrix Multiplication

The thing is, when we’ve learned about deep learning, we learned that we do matrix multiplications, not look something up in a matrix, in an array.

[01:24:03]

So in Excel, we were saying offset, which is to say find element number 14 in the table, which, that’s not a matrix multiply. How does that work? Well, actually it is. It actually is for the same reason that we talked about here, which is we can represent, find the element number one thing in this list, is actually the same as multiplying by a one-hot encoded matrix. So remember how, if we, let’s just take off the log for a moment. Look, this is returned 0.87. And particularly if I take the negative off here, if I add this up, this is 0.87, which is the result of finding the index number one thing in this list.

[01:25:14]

But we didn’t do it that way. We did this by taking the dot product of this, sorry, of this and this. But that’s actually the same thing, right? Taking the dot product of a one-hot encoded vector with something is the same as looking up this index in the vector. So that means that this exercise here of looking up the 14th thing is the same as doing a matrix multiply with a one-hot encoded vector.

One-Hot Encoded Vector

And we can see that here. This is how we create a one-hot encoded vector of length n users, in which the third element is set to 1 and everything else is 0.

[01:26:07]

And if we multiply that, so at means, do you remember, matrix multiply in Python? So if we multiply that by our user factors, we get back this answer. And if we just ask for user factors number three, we get back the exact same answer. They’re the same thing.

Embedding as a Computational Shortcut

So you can think of an embedding as being a computational shortcut for multiplying something by a one-hot encoded vector. And so if you think back to what we did with dummy variables, right, this basically means embeddings are like a cool math trick for speeding up doing matrix multipliers with dummy variables. Not just speeding up, we never even have to create the dummy variables. We never have to create the one-hot encoded vectors. We can just look up in an array.

[01:27:12]

Collaborative Filtering Model

All right, so we’re now ready to build a collaborative filtering model.

Creating a Model from Scratch

And we’re going to create one from scratch. And as we’ve discussed before, in PyTorch, a model is a class. And so we briefly touched on this, but I’m going to touch on it again.

Creating a Class in Python

This is how we create a class in Python. You give it a name, and then you say how to initialize it, how to construct it. So in Python, remember, they call these things dunder, whatever, this is dunder init. These are magic methods that Python will call for you at certain times.

[01:28:03]

Magic Methods

The method called dunder init is called when you create an object of this class. So we could pass it a value. And so now we set the attribute called a equal to that value. And so then later on, we could call a method called say that will say hello to whatever you passed in here. And this is what it will say. So for example, if you construct an object of type example, passing in sylva, self.a now equals sylva. So if you say, use the dot method, the dot say method, nice to meet you. x is now nice to meet you. So it will say hello, sylva, nice to meet you. So that’s kind of all you need to know about object-oriented programming in PyTorch to create a model.

[01:29:02]

Object-Oriented Programming in PyTorch

Oh, there is one more thing we need to know, sorry, which is you can put something in parentheses after your class name, and that’s called the super class.

Super Class

It’s basically going to give you some stuff for free, give you some functionality for free. And if you create a model in PyTorch, you have to make module your super class.

Module Super Class

This is actually Fast.ai’s version of module, but it’s nearly the same as PyTorch’s. So when we create this dot product object, it’s going to call dunder init.

dunder init Method

And we have to say, well, how many users are going to be in our model? And how many movies? And how many factors? And so we can now create an embedding of users by factors for users and an embedding of movies by factors for movies. And so then PyTorch does something quite magic, which is that if you create a dot product object like so, then you can treat it like a function.

[01:30:11]

Treating a Model as a Function

You can call it and calculate values on it. And when you do that, this is really important to know, PyTorch is going to call a method called forward in your class.

forward Method

So this is where you put your calculation of your model. It has to be called forward. And it’s going to be past the object itself and the thing you’re calculating on. In this case, the user and movie for a batch. So this is your batch of data. Each row will be one user and movie combination, and the columns will be users and movies. So we can grab the first column, right? So this is every row of the first column, and look it up in the user factors embedding to get our users embeddings.

[01:31:07]

So that is the same as doing this. Let’s say this is one mini batch. And then we do exactly the same thing for the second column, passing it into our movie factors to look up the movie embeddings. And then take the dot product. Dim equals one, because we’re summing across the columns for each row. We’re calculating a prediction for each row.

Training the Model

So once we’ve got that, we can pass it to a learner, passing in our data loaders and our model. And our loss function means squared error. And we can call fit. And away it goes.

[01:32:00]

And this, by the way, is running on CPU. These are very fast to run. So this is doing 100,000 rows in 10 seconds, which is a whole lot faster than our few dozen rows in Excel. And so you can see the loss going down. And so we’ve trained a model.

Model Limitations

It’s not going to be a great model. And one of the problems is that, let’s see if we can see this in our Excel one. Look at this one here. This prediction’s bigger than five. But nothing’s bigger than five. So that seems like a problem. We’re predicting things that are bigger than the highest possible number. And in fact, these are very much movie enthusiasts.

[01:33:01]

Movie Enthusiasts

Nobody gave anything a one. Yeah, nobody even gave anything a one here. So do you remember when we learned about sigmoid, the idea of squishing things between zero and one?

Sigmoid Function

We could do stuff still without a sigmoid. But when we added a sigmoid, it trained better, because the model didn’t have to work so hard to get it kind of into the right zone. Now, if you think about it, if you take something and put it through a sigmoid, and then multiply it by five, now you’ve got something that’s going to be between zero and five. Used to have something that’s between zero and one. So we could do that. In fact, we could do that in Excel. I’ll leave that as an exercise to the reader. Let’s do it over here in PyTorch.

Sigmoid Range

So if we take the exact same class as before, and this time we call sigmoid range. And so sigmoid range is something which will take our prediction and then squash it into our range.

[01:34:08]

And by default, we’ll use a range of zero through to 5.5. So it can’t be smaller than zero, can’t be bigger than 5.5. Why didn’t I use five? That’s because a sigmoid can never hit one, right? And a sigmoid times five can never hit five. But some people do give things movies five. So you want to make it a bit bigger than our highest. So this one got a loss of 0.8628. Oh, it’s not better. Isn’t that always the way? All right, didn’t actually help, doesn’t always, so be it.

Improving the Model

Let’s keep trying to improve it. Let me show you something I noticed. Some of the users, like this one, this person here just loved movies.

[01:35:06]

User Bias

They give nearly everything a four or five. Their worst score is a three, right? This person, oh here’s a one, this person’s got much more range. Some things are twos, some ones, some fives. This person doesn’t seem to like movies very much considering how many they watch. Nothing gets a five. They’ve got discerning tastes, I guess. At the moment, we don’t have any way in our kind of formulation of this model to say this user tends to give low scores and this user tends to give high scores. There’s just nothing like that, right? But that would be very easy to add. Let’s add one more number to our five factors, just here, for each user.

Adding User Bias to the Model

And now, rather than doing just the matrix multiply, let’s add, oh it’s actually the top one, let’s add this number to it, h19.

[01:36:17]

And so for this one, let’s add i19 to it. Yeah, so I’ve got it wrong. This one here, so this this row here, we’re going to add to each rating. And then we’re going to do the same thing here.

Movie Bias

Each movie’s now got an extra number here that, again, we’re going to add a 26. So it’s our matrix multiplication plus, we call it the bias, the user bias plus the movie bias. So effectively, that’s like making it so we don’t have an intercept of zero anymore.

[01:37:02]

Training with Bias

And so if we now train this model, data, solver, solve. So previously we got to 0.42, okay, and so we’re going to let that go along for a while. And then let’s also go back and look at PyTorch version. So for PyTorch now, we’re going to have a user bias, which is an embedding of n users by one, right. Remember there was just one number for each user. And movie bias is an embedding of n movies also by one. And so we can now look up the user embedding, the movie embedding, do the dot product, and then look up the user bias and the movie bias and add them. Chuck that through the sigmoid.

[01:38:03]

Let’s train that, see if we beat 0.865. Wow, we’re not training very well, are we?

Overfitting

Still not too great, 0.894. I think Excel normally does do better though. Let’s see. Okay, Excel. Oh, Excel’s done a lot better. It’s gone from 0.42 to 0.35. Okay, so what happened here? Why did it get worse? Well, look at this. The valid loss got better, and then it started getting worse again. So we think we might be overfitting, which, you know, we have got a lot of parameters in our embeddings.

Weight Decay

So how do we avoid overfitting?

[01:39:01]

So a classic way to avoid overfitting is to use something called weight decay, also known as L2 regularization, which sounds much more fancy.

Weight Decay in the Loss Function

What we’re going to do is, when we compute the gradients, we’re going to first add to our loss function the sum of the weights squared. Now this is something you should go back and add to your titanic model, not that it’s overfitting, but just to try it, right? So previously, our gradients have just been, and our loss function, has just been about the difference between our predictions and our actuals, right? And so our gradients were based on the derivative of that with respect to the, the derivative of that with respect to the coefficients. But we’re saying now, let’s add the sum of the square of the weights times some small number.

[01:40:05]

So what would make that loss function go down? That loss function would go down if we reduce our weights. For example, if we reduce all of our weights to zero, I should say we reduce the magnitude of our weights. If we reduce them all to zero, that part of the loss function will be zero, because the sum of zero squared is zero. Now problem is, if our weights are all zero, our model doesn’t do anything, right? So we’d have crappy predictions. So I would want to increase the weights, so that’s actually predicting something useful. But if it increases the weights too much, then it starts overfitting. So how is it going to actually get the lowest possible value of the loss function? By finding the right mix. Weights not too high, right?

[01:41:02]

But high enough to be useful at predicting. If there’s some parameter that’s not useful, for example, say we asked for five factors and we only need four, it can just set the weights for the fifth factor to zero, right? And then problem solved, right? It won’t be used to predict anything, but it also won’t contribute to our weight decay part.

Weight Decay in PyTorch

So previously, we had something calculated in the loss function, so now we’re going to do exactly the same thing, but we’re going to square the parameters, we’re going to sum them up, and we’re going to multiply them by some small number, like 0.01 or 0.001. And in fact, we don’t even need to do this, because remember, the whole purpose of the loss is to take its gradient, right?

[01:42:06]

Time to print it out. The gradient of parameters squared is two times parameters. It’s okay if you don’t remember that from high school, but you can take my word for it. The gradient of y equals x squared is 2x. So actually, all we need to do is take our gradient and add the weight decay coefficient, 0.01 or whatever, times two times parameters. And given this is just number, some number we get to pick, we may as well fold the two into it and just get rid of it. So when you call fit, you can pass in a wd parameter, which does, adds this times the parameters to the gradient for you. And so that’s going to ask the model, it’s going to say to the model, please don’t make the weights any bigger than they have to be.

[01:43:06]

Reducing Overfitting

And yay, finally, our loss actually improved. Okay, and you can see it getting better and better. In fast AI applications like Vision, we try to set this for you appropriately, and we generally do a reasonably good job, just the defaults are normally fine. But in things like tabular and collaborative filtering, we don’t really know enough about your data to know what to use here. So you should just try a few things. Let’s try a few multiples of 10, start at 0.1, and then divide by 10 a few times, you know, and just see which one gives you the best result.

Regularization

So this is called regularization. So regularization is about making your model model no more complex than it has to be, right? It has a lower capacity.

[01:44:00]

And so the higher the weights, the more they’re moving the model around, right? So we want to keep the weights down, but not so far down that they don’t make good predictions. And so the value of this, if it’s higher, will keep the weights down more, it will reduce overfitting, but it will also reduce the capacity of your model to make good predictions. And if it’s lower, it increases the capacity of model and increases overfitting.

Next Time

All right. I’m going to take this bit for next time. Before we wrap up, John, are there any more questions?

[John]

Questions

Yeah, there are. There’s some from back at the start of the collaborative filtering. So we had a bit of a conversation a while back about the size of the embedding vectors.

[01:45:00]

Hyperparameter Search

And you talked about your fast AI rule of thumb. So there was a question if anyone has ever done a kind of a hyperparameter search and exploration.

[Jeremy Howard]

I mean, people often will do a hyperparameter search for sure. People will often do a hyperparameter search for their model, but I haven’t seen any other rules other than my rule of thumb.

[John]

Right. So not productively to your knowledge.

[Jeremy Howard]

Productively for an individual model that somebody’s built.

[John]

And then there’s a question here from Zakir, which I didn’t quite wrap my head around.

Recommendation Systems Based on Averages

So Zakir, if you want to maybe clarify in the chat as well, but can recommendation systems be built based on average ratings of users experience rather than collaborative filtering?

[Jeremy Howard]

Not really. Right. I mean, if you’ve got lots of metadata, you could. Right. So if you’ve got lots of information about demographic data, about where the user’s from and what loyalty scheme results they’ve had and blah, blah, blah.

[01:46:06]

And then for products, there’s metadata about that as well. Then sure, averages would be fine. But if all you’ve got is kind of purchasing history, then you really want granular data. Otherwise, how could you say like, they like this movie, this movie and this movie. Therefore, they might also like that movie or you’ve got is like, oh, they kind of like movies. There’s just not enough information there. Yep. Great. That’s about it.

Conclusion

Thanks. Okay, great. All right. Thanks everybody. See you next time for our last lesson.