Lesson 8 – Practical Deep Learning for Coders 2022

[Jeremy Howard]

Welcome to the last lesson of Part 1

So welcome to the last lesson of part one of practical deep learning for coders. It’s been a really fun time doing this course. And depending on when you’re watching and listening to this, you may want to check the forums or the fast.ai website to see whether we have a part two planned, which is going to be sometime towards the end of 2022. Or if it’s already past that, then maybe there’s even a part two already on the website. So part two goes a lot deeper than part one, technically, in terms of getting to the point that you should be able to read and implement research papers and deploy models, you know, in a very kind of real life situation.

[00:01:09]

Collaborative Filtering Notebook

So, yeah, last lesson, we started on the collaborative filtering notebook.

Creating your own embedding module

And we were looking at collaborative filtering. And this is where we got to, which is creating your own embedding module. And this is a very cool, this is a very cool place to start the lesson, because you’re going to learn a lot about what’s really going on.

Importance of understanding the 0.5 linear model and neural net from scratch notebook

And it’s really important before you dig into this, to make sure that you’re really comfortable with the 0.5 linear model and neural net from scratch notebook. So if parts of this are not totally clear, put it aside and redo this notebook.

[00:02:02]

Because what we’re looking at from here are kind of the abstractions that PyTorch and fast.ai add on top of functionality that we’ve built ourselves from scratch.

PyTorch’s handling of parameters

So if you remember in the neural network from scratch we built, we initialized a number of coefficients, a couple of different layers, you know, and a bias term. And then during, as the model trained, we updated those coefficients by going through each layer of them and subtracting out the gradients by the learning rate. In, you probably noticed that in PyTorch we don’t have to go to all that trouble. And I wanted to show you how PyTorch does this. PyTorch, we don’t have to keep track of what our coefficients or parameters or weights are.

[00:03:03]

PyTorch does that for us.

PyTorch’s parameter tracking mechanism

And the way it does that is it looks inside our module and it tries to find anything that looks like a neural network parameter, or a tensor of neural network parameters. And it keeps track of them. And so here is a class we’ve created called T, which is a subclass of module. And I’ve created one thing inside it, which is something with the attribute a. So this is a in the T module. And it just contains three ones. And so the idea is, you know, maybe we’re creating a module and this is, we’re initializing some parameter that we want to train. Now we can find out what trainable parameters or just what parameters in general PyTorch knows about in our model by instantiating our model and then asking for the parameters, which you then have to turn that into a list or in Fastcore we have a thing called L, which is like a fancy list, which prints out the number of items in the list and shows you those items.

[00:04:12]

Now, in this case, when we create our object of type T and ask for its parameters, we get told there are zero tensors of parameters and a list with nothing in it. Now, why is that? We actually said we wanted to create a tensor with three ones in it. How would we make those parameters?

Creating parameters using nn.parameter

Well, the answer is that the way you create, the way you tell PyTorch what your parameters are, is you actually just have to put them inside a special object called an nn.parameter. This thing almost doesn’t really do anything. In fact, last time I checked, it really quite literally had almost no code in it. Sometimes these things change, but let’s take a look.

[00:05:01]

Yeah, okay, so it’s about a dozen lines of code or 20 lines of code, which does almost nothing. It’s got a way of being copied, it’s got a way of printing itself, it’s got a way of saving itself, and it’s got a way of being initialized. So a parameter hardly does anything. The key thing is, though, that when PyTorch checks to see which parameters should it update, when it optimizes, it just looks for anything that’s been wrapped in this parameter class. So if we do exactly the same thing as before, which is to set an attribute containing a tensor with three ones in it, but this case we wrap it in a parameter, we now get told, okay, there’s one parameter tensor in this model, and it contains a tensor with three ones. And you can see it also actually, by default, assumes that we’re going to require a gradient.

[00:06:02]

PyTorch’s automatic gradient calculation

It’s assuming that anything that’s a parameter is something that you want to calculate gradients for. Now, most of the time we don’t have to do this, because PyTorch provides lots of convenient things for us, such as what you’ve seen before, nn.linear, which is something that also creates a tensor.

PyTorch’s nn.linear layer

So this would create a tensor of one by three without a bias term in it. This is not being wrapped in an nn.parameter, but that’s okay. PyTorch knows that anything which is basically a layer in a neural net is going to be a parameter, so it automatically considers this a parameter. So here’s exactly the same thing again. I construct my object of type T, I check for its parameters, and I can see there’s one tensor of parameters, and there’s our three things. And you’ll notice that it’s also automatically randomly initialized them, which again is generally what we want.

[00:07:04]

So PyTorch does go to some effort to try to make things easy for you.

Linear layer attributes

So this attribute A is a linear layer, and it’s got a bunch of things in it. One of the things in it is the weights, and that’s where you’ll actually find the parameters. That is of type parameter. So a linear layer is something that contains attributes of type parameter. Okay, so what we want to do is we want to create something that works just like this did, which is something that creates a matrix, which will be trained as we train the model.

[00:08:02]

Creating an embedding module from scratch

Okay, so an embedding is something which, yeah, it’s going to create a matrix of this by this, and it will be a parameter, and it’s something that, yeah, we need to be able to index into as we did here. And so, yeah, what is happening behind the scenes, you know, in PyTorch? It’s nice to be able to create these things ourselves from scratch because it means we really understand it.

Understanding PyTorch’s embedding layer

And so let’s create that exact same module that we did last time, but this time we’re going to use a function I’ve created called createParams. You pass in a size, so such as, in this case, n uses by n factors, and it’s going to call torch.zeros to create a tensor of zeros of the size that you request, and then it’s going to do a normal random distribution,

[00:09:07]

so a Gaussian distribution of mean zero, standard deviation 0.01 to randomly initialize those, and it’ll put the whole thing into an nn.parameter. So this here is going to create an attribute called userFactors, which will be a parameter containing some tensor of normally distributed random numbers of this size. Excuse me.

Initializing parameters with random numbers

And because that’s a parameter, that’s going to be stored inside, that’s going to be available as n parameters in the module. Oh, why am I sneezing? So userBias will be a vector of parameters, userFactors will be a matrix of parameters, movieFactors will be a matrix, n movies by n factors, movieBias will be a vector of n movies, and this is the same as before.

[00:10:08]

Indexing into parameters

So now in the forward, we can do exactly what we did before. The thing is, when you put a tensor inside a parameter, it has all the exact same features that a tensor has. So for example, we can index into it. So this whole thing is identical to what we had before.

Replicating PyTorch’s embedding layer from scratch

And so that’s actually, believe it or not, all that’s required to replicate PyTorch’s embedding layer from scratch. So let’s run those and see if it works. And there it is, it’s training. So we’ll be able to have a look when this is done at, for example, model.movieBias. And here it is, right?

[00:11:10]

It’s a parameter containing a bunch of numbers that have been trained. And as we’d expect, it’s got 1665 things in, because that’s how many movies we have. So a question from Jonah Raphael was, does torch.zeros not produce all zeros?

torch.zeros and torch.normal_

Yes, torch.zeros does produce all zeros. But remember, a method that ends in underscore changes in place the tensor it’s being applied to. And so if you look up PyTorch normal underscore, you’ll see it fills itself with elements sampled from the normal distribution.

[00:12:03]

So this is actually modifying this tensor in place. And so that’s why we end up with which isn’t just zeros.

Interpreting the trained movieBias parameter

Now, this is the bit I find really fun, is we train this model, but what did it do? How is it going about predicting who’s going to like what movie? Well, one of the things that’s happened is we’ve created this movieBias parameter, which has been optimized. And what we could do is we could find which movie IDs have the highest numbers here, and the lowest numbers.

[00:13:00]

So I think this is going to start lowest. And then we can print out, we can look inside our data loaders and grab the names of those movies for each of those five lowest numbers.

Identifying movies with low and high movieBias

And what’s happened here? Well, we can see broadly speaking that it has printed out some pretty crappy movies. And why is that? Well, that’s because when it does that matrix product that we saw in the Excel spreadsheet last week, it’s trying to figure out who’s going to like what movie based on previous movies people have enjoyed or not. And then it adds movieBias, which can be positive or negative. That’s a different number for each movie. So in order to do a good job of predicting whether you’re going to like a movie or not, it has to know which movies are crap. And so the crap movies are going to end up with a very low movieBias parameter.

[00:14:01]

And so we can actually find out which movies to people, not only which movies to people really not like, but which movies to people like, like less than one would expect, given the kind of movie that it is.

Understanding the meaning of movieBias

So Lawnmower Man 2, for example, not only apparently, is it a crappy movie, but based on the kind of movie it is, you know, it’s kind of like a high tech pop kind of sci-fi movie. People who like those kinds of movies still don’t like Lawnmower Man 2. So that’s what this is meaning. So it’s kind of nice that we can like use a model, not just to predict things, but to understand things about the data. So if we sort by descending, it’ll give us the exact opposite. So here are movies that people enjoy, even when they don’t normally enjoy that kind of movie.

[00:15:00]

So for example, L.A. Confidential, a classic kind of film noir detective movie with the Aussie Guy Pearce. Even if you don’t really like film noir detective movies, you might like this one. You know, Silence of the Lambs, classic kind of, I guess you’d say like horror kind of, not horror, is it a suspense movie? Even people who don’t normally like kind of serial killer suspense movies tend to like this one. Now, the other thing we can do is not just look at what’s happening in the bias.

Analyzing user bias

Oh, and by the way, we could do the same thing with users and find out like which user just loves movies, even the crappy ones, you know, just likes all movies and vice versa. But what about the other thing? We didn’t just have bias, we also had movie factors, which has got the number of movies as one axis and the number of factors as the other, and we passed in 50.

[00:16:07]

Visualizing movie factors using PCA

What’s in that huge matrix? Well, pretty hard to visualize such a huge matrix. And we’re not going to talk about the details, but you can do something called PCA, which stands for Principal Component Analysis, and that basically tries to compress those 50 columns down into three columns. And then we can draw a chart of the top two. And so this is PCA component number one, and this is PCA component number two, and here’s a bunch of movies. And this is a compressed view of these latent factors that it created. And you can see that they obviously have some kind of meaning, right? So over here towards the right, we’ve got kind of, you know, very pop mainstream kind of movies.

[00:17:04]

And over here on the left, we’ve got more of the kind of critically acclaimed gritty kind of movies. And then towards the top, we’ve got very kind of action-oriented and sci-fi movies. And then down towards the bottom, we’ve got very dialogue-driven movies. So remember, we didn’t program in any of these things, and we don’t have any data at all about what movie is what kind of movie.

The power of SGD in learning latent factors

But thanks to the magic of SGD, we just told it to please try and optimize these parameters. And the way it was able to predict who would like what movie was it had to figure out what kinds of movies are there or what kind of taste is there for each movie. So I think that’s pretty interesting. So this is called visualizing embeddings, and then this is visualizing the bias.

[00:18:11]

Using Fast.ai’s collaborative learner

We obviously would rather not do everything by hand like this, or even like this. And Fast.ai provides an application for collaborative learner. And so we can create one. And this is going to look much the same as what we just had. We’re and what the Y range is to do the sigmoid and the multiply. And then we can do fit, and away it goes.

Comparing manual and Fast.ai models

So let’s see how it does. All right, so it’s done a bit better than our manual one.

[00:19:03]

Let’s take a look at the model it created. The model looks very similar to what we created in terms of the parameters. You can see these are the two embeddings, and these are the two biases. And we can do exactly the same thing.

Analyzing item bias in the Fast.ai model

We can look in that model, and we can find the… You’ll see it’s not called movies, it’s i for items. It’s users and items. This is the item bias. So we can look at the item bias, grab the weights, sort, and we get a very similar result. In this case, it’s even more confident that LA Confidential is a movie that you should probably try watching, even if you don’t like those kind of movies. And Titanic’s right up there as well, even if you don’t really like romance-y kind of movies, you might like this one. Even if you don’t like classic detective, you might like this one. You know, we can have a look at the source code for Colab Learner, and we can see that, let’s see, useNN is false by default.

[00:20:14]

Examining the source code of Colab Learner

So our model is going to be of this type, embedding.bias. So we can take a look at that. Here it is. And look, this does look very similar. Okay, it’s creating an embedding using the size we requested for each of usersByFactors and itemsByFactors and usersAndItems. And then it’s grabbing each thing from the embedding in the forward, and it’s doing the multiply. And it’s adding it up, and it’s doing the sigmoid. So yeah, it looks exactly the same. Isn’t that neat?

[00:21:00]

So you can see that what’s actually happening in real models is not, yeah, it’s not that weird or magic.

The usefulness of PCA in other areas

So Kurian is asking, is PCA useful in any other areas? And the answer is absolutely. And what I suggest you do, if you’re interested, is check out our Computational Linear Algebra course. It’s five years old now, but I mean, this is stuff which hasn’t changed for decades, really. And this will teach you all about things like PCA and stuff like that. It’s not nearly as directly practical as practical deep learning for coders, but it’s definitely very interesting.

[00:22:02]

And it’s the kind of thing which, if you want to go deeper, it can become pretty useful later along your path. Okay.

Finding similar movies based on embeddings

So here’s something else interesting we can do. Let’s grab the movie factors. So that’s in our model. It’s the item and it’s the weight attribute that PyTorch creates. Okay. And now we can convert the movie Silence of the Lambs into its class ID. And we can do that with object to ID, O to I, for the titles. And so that’s the movie index of Silence of the Lambs. And what we can do now is we can look through all of the movies in our latent factors and calculate how far apart each vector is, each embedding vector is, from this one.

[00:23:00]

And this cosine similarity is very similar to basically the Euclidean distance, you know, the root sum squared of the differences, but it normalizes it. So it’s basically the angle between the vectors. So this is going to calculate how similar each movie is to the Silence of the Lambs based on these latent factors. And so then we can find which ID is the closest. Yeah. So based on this embedding distance, the closest is dial M for murder, which makes a lot of sense.

The bootstrapping problem in collaborative filtering

I’m not going to discuss it today, but in the book, there’s also some discussion about what’s called the bootstrapping problem, which is the question of like, if you’ve got a new company or a new product, how would you get started with making recommendations given that you don’t have any previous history with which to make recommendations?

[00:24:16]

And that’s a very interesting problem that you can read about in the book. Now, that’s one way to do collaborative filtering, which is where we create that, do that matrix completion exercise using all those dot products.

Collaborative filtering using deep learning

There’s a different way, however, which is we can use deep learning. And to do it with deep learning, what we could do is we can, we could basically create our user and item embeddings as per usual.

[00:25:00]

And then we could create a sequential model.

Creating a sequential model for collaborative filtering

So the sequential model is just layers of a deep learning neural network in order. And what we could do is we could just concatenate. So in forward, we could just concatenate the user and item embeddings together and then do a value. So this is, this is basically a single hidden layer neural network, and then a linear layer at the end to create a single output.

A simple neural network for collaborative filtering

So this is a very, you know, well, it’s most simple neural net, exactly the same as the style that we created back here in our neural net from scratch. This is exactly the same. But we’re using PyTorch’s functionality to do it more easily. So in the forward here, we’re going to, in the same, exactly the same way as we have before, we’ll look up the user embeddings and we’ll look up the item embeddings.

[00:26:05]

And then this is new. This is where we concatenate those two things together and put it through our neural network and then finally do our sigmoid.

Using Fast.ai’s getEmbeddingSizes

Now, one thing different this time is that we’re going to ask Fast.ai to figure out how big our embeddings should be. And so Fast.ai has something called get embedding sizes, and it just uses a rule of thumb that says that for 944 users, we recommend 74 factor embeddings. And for 1665 movies, or is it the other way around? I can’t remember. We recommend 102 factors for your embeddings. So that’s what those sizes are. So now we can create that model and we can pop it into a learner and fit in the usual way.

[00:27:06]

Training a deep learning model for collaborative filtering

And so rather than doing all that from scratch, what you can do is you can do exactly the same thing that we’ve done before, which is to call collaborative learner, but you can pass in the parameter use neural network equals true.

Using collaborative learner with useNN=True

And you can then say how big do you want each layer. So this is going to create a two hidden layer deep learning neural net. The first will have 1500 and the second will have 50. And then you can say fit and away it goes. Okay. So here is our, we’ve got 0.87. So these are doing less well than our dot product version, which is not too surprising because kind of the dot product version is really trying to take advantage of our understanding of the problem domain.

[00:28:05]

Combining dot product and neural network components

In practice nowadays, a lot of companies kind of combine, they kind of create a combined model that has a dot product component and also has a neural net component.

Incorporating metadata into collaborative filtering

The neural net components particularly helpful if you’ve got metadata, for example, information about your users, like when did they sign up? How old are they? What sex are they? You know, where are they from? And then those are all things that you could concatenate in with your embeddings and ditto with metadata about the movie, how old is it, what genre is it and so forth. All right. So we’ve got a question from Jonah, which I think is interesting.

The issue of bias in collaborative filtering

And the question is, is there an issue where the bias components are overwhelmingly determined by the non-experts in a genre?

[00:29:08]

In general, actually, there’s a more general issue, which is in collaborative filtering recommendation systems, very often a small number of users or a small number of movies overwhelm everybody else. And the classic one is anime.

The anime example in collaborative filtering

A relatively small number of people watch anime and those groups of people watch a lot of anime. So in movie recommendations, like there’s a classic problem, which is every time people try to make a list of well-loved movies, all the top ones tend to be anime. And so you can imagine what’s happening in the matrix completion exercise is that there are, yeah, some users that just, you know, really watch this one genre of movie and they watch an awful lot of them.

[00:30:04]

So in general, you actually do have to be pretty careful about the, you know, these subtlety kind of issues. And yeah, I won’t go into details about how to deal with them, but they generally involve kind of taking various kinds of ratios or normalizing things or so forth. All right. So that’s collaborative filtering.

Embeddings beyond collaborative filtering

And I wanted to show you something interesting then about embeddings, which is that embeddings are not just for collaborative filtering. And in fact, if you’ve heard about embeddings before, you’ve probably heard about them in the context of natural language processing.

Embeddings in natural language processing

So you might’ve been wondering back when we did the hugging face transformers stuff, how did we go about, you know, using text as inputs to models?

[00:31:05]

Turning words into integers

And we talked about how you can turn words into integers. We make a list. So here’s the movie, sorry, movie. Here’s the poem, I am Sam. I am Daniel, I am Sam, Sam I am, that’s Sam I am, et cetera, et cetera. We can find a list of all the unique words in that poem and make this list here. And then we can give each of those words a unique ID just arbitrarily. Well, actually in this case, it’s alphabetical order, but it doesn’t have to be. And so we kind of talked about that and that’s what we do with categories in general. But how do we turn those into like, you know, lists of random numbers.

Creating an embedding matrix for NLP

And you might not be surprised to hear what we do is we create an embedding matrix. So here’s an embedding matrix containing four latent factors for each word in the vocab.

[00:32:03]

So here’s each word in the vocab and here’s the embedding matrix. So if we then want to present this poem to a neural net, then what we do is we list out our poem.

Using an embedding matrix to represent a poem

I do not like that Sam I am, do you like green eggs and ham, et cetera. Then for each word, we look it up. So in Excel, for example, we use match. So that will find this word over here and find it is word ID eight. And then we will find the eighth word. And the first embedding. And so that’s gives us, that’s not right.

[00:33:01]

Eight. Oh no, that is right. Sorry. Here it is. It’s just weird column. It’s, so it’s going to be 0.22, then 0.1, 0.01. And here it is 0.22, 0.1, 0.01, et cetera. So this is the embedding matrix we end up with for this poem.

Interpreting embeddings in NLP models

And so if you wanted to train or use a trained neural network on this poem, you basically turn it into this matrix of numbers. And so this is what an embedding matrix looks like in an NLP model. And it do exactly the same things in terms of interpretation of an NLP model by looking at both the bias factors and the latent factors in a word embedding matrix.

[00:34:00]

Common principles in neural network inputs

So hopefully you’re getting the idea here that our, you know, our different models, you know, the inputs to them, they’re based on a relatively small number of kind of basic principles. And these principles are generally things like look up something in array. And then we know inside the model, we’re basically multiplying things together, adding them up and replacing the negatives with zeros.

The simplicity of neural network operations

So hopefully you’re getting the idea that what’s going on inside a neural network is generally not that complicated. But it happens very quickly and at scale.

Embeddings in tabular analysis

Now, it’s not just collaborative filtering and NLP, but also tabular analysis.

[00:35:01]

Using neural networks for tabular data

So in chapter nine of the book, we’ve talked about how random forests can be used for this, which was for, this is for the thing where we’re predicting the auction sale price of industrial heavy equipment like bulldozers. Instead of using a random forest, we can use a neural net. Now, in this data set, there are some continuous columns.

Separating continuous and categorical columns

And there are some categorical columns. Now, I’m not going to go into the details too much. But in short, we can separate out the continuous columns and categorical columns using contcat split. And that will automatically find which is which based on their data types. And so in this case, it looks like, okay, so continuous columns, the elapsed sale date.

[00:36:06]

So I think that’s the number of seconds or years or something since the start of the data set is a continuous variable. And then here are the categorical variables. So for example, there are six different product sizes and two coupler systems, 5,059 model descriptions, six enclosures, 17 tire sizes, and so forth. So we can use fast.ai basically to say, okay, we’ll take that data frame and pass in the categorical and continuous variables and create some random splits.

Creating a tabular learner

And what’s the dependent variable. And we can create data loaders from that. And from that, we can create a tabular learner.

[00:37:06]

The tabular model’s structure

And basically, what that’s going to do is it’s going to create a pretty regular multi-layer neural network, not that different to this one that we created by hand. And each of the categorical variables, it’s going to create an embedding for it. And so I can actually show you this, right.

Examining the tabular model’s code

So we’re going to use tabular learner to create the learner. And so tabular learner is 1, 2, 3, 4, 5, 6, 7, 8, 9 lines of code. And basically the main thing it does is create a tabular model. And so then tabular model, you’re not going to understand all of it, but you might be surprised at how much. So a tabular model is a module. We’re going to be passing in how big is each embedding going to be.

[00:38:01]

Automatic embedding size calculation in tabular learner

And tabular learner, what’s that passing in? It’s going to call getEmbeddingSizes, just like we did manually before, automatically. So that’s how it gets its embedding sizes. And then it’s going to create an embedding for each of those embedding sizes, from number of inputs to number of vectors. Dropout, we’re going to come back to later. BatchNorm, we won’t do till part two. So then it’s going to create a layer for each of the layers we want, which is going to contain a linear layer, followed by batchNorm, followed by dropout. It’s going to add the sigmoid range we’ve talked about at the very end. And so the forward, this is the entire thing.

The forward pass in the tabular model

If there’s some embeddings, it’ll go through and get each of the embeddings using the same indexing approach we’ve used before.

[00:39:01]

It’ll concatenate them all together, and then it’ll run it through the layers of the neural net, which are these. So yeah, we don’t know all of those details yet, but we know quite a few of them. So that’s encouraging, hopefully. And once we’ve got that, we can do the standard LRFIND and FIT.

Training a tabular learner

Now, this exact dataset was used in a Kaggle competition.

Kaggle competition and the tabular model

This dataset was in a Kaggle competition. And the third place getter published a paper about their technique, and it’s basically the exact, almost the exact one I’m showing you here.

[00:40:01]

So it wasn’t this, sorry, it wasn’t this dataset. It was a dataset, it was a different one. It was about predicting the amount of sales in different stores. But they used this basic kind of technique.

The use of embeddings in a Kaggle competition

And one of the interesting things is that they used a lot less manual feature engineering than the other high placed entries. Like they had a much simpler approach. And one of the interesting things, they published a paper about their approach. So they published a paper about their approach. So this is the team from this company. And they basically describe here exactly what I just showed you, these different embedding layers being concatenated together and then going through a couple of layers of a neural network.

[00:41:03]

Embedding layers as linear layers on one-hot encoded inputs

And it’s showing here, it points out in the paper exactly what we learned in the last lesson, which is embedding layers are exactly equivalent to linear layers on top of a one hot encoded input. And yeah, they found that their technique worked really well.

Combining embeddings with other models

One of the interesting things they also showed is that you can create your neural net, get your trained embeddings, and then you can put those embeddings into a random forest or gradient booster tree. And your mean average percent error will dramatically improve. So you can actually combine random forests and embeddings or gradient booster trees and embeddings, which is really interesting. Now, what I really wanted to show you though, is what they then did.

Visualizing embeddings for German regions

So as I said, this was a thing about the predicted amount that different products would sell for at different shops around Germany.

[00:42:07]

And what they did was they had a, so one of their embedding matrices was embeddings by region. And then they did a, I think this is a PCA, principal component analysis, of the embeddings for their German regions. And when they create a chart of them, you can see that the locations that are close together in the embedding matrix are the same locations that are close together in Germany.

Reconstructing geography through embeddings

So you can see here’s the blue ones, and here’s the blue ones. And again, it’s important to recognize that the data that they used had no information about the location of these places. The fact that they are close together geographically is something that was figured out as being something that actually helped it to predict sales.

[00:43:03]

And so in fact, they then did a plot showing each of these dots is a store. And it’s showing for each pair of stores, how far away is it in real life, in metric space? And then how far away is it in embedding space? And there’s this very strong correlation, right? So it’s, you know, it’s kind of reconstructed somehow, this kind of the kind of the geography of Germany by figuring out how people shop.

Visualizing embeddings for days of the week and months of the year

And similar for days of the week. So there was no information really about days of the week, but when they put it on the embedding matrix, the days of the week, Monday, Tuesday, Wednesday, close to each other, Thursday, Friday, close to each other, as you can see, Saturday and Sunday, close to each other. And ditto for months of the year, January, February, March, April, May, June.

[00:44:00]

So yeah, really interesting, cool stuff, I think.

Understanding the inner workings of neural networks

What’s actually going on inside a neural network.

Break

All right, let’s take a 10 minute break. And I will see you back here at 7.10. All right, folks, this is something I think is really fun, which is we’re going to, we’ve looked at what goes into the start of a model, the input.

Reviewing the components of a neural network

We’ve learned about how they can be categories or embeddings. And embeddings are basically kind of one-hot encoded categories with a little compute trick, or they can just be continuous numbers.

[00:45:01]

We’ve learned about what comes out the other side, which is a bunch of activations. So just a bunch of tensor of numbers, which we can use things like softmax to constrain them to add up to one and so forth. And we’ve looked at what can go in the middle, which is the matrix multipliers sandwiched together with, you know, as rectified linear units. And I mentioned that there are other things that can go in the middle as well, but we haven’t really talked about what those other things are.

Introducing convolutions

So I thought we might look at one of the most important and interesting version of things that can go in the middle. But what you’ll see is it turns out it’s actually just another kind of matrix multiplication, which might not be obvious at first, but I’ll explain. We’re going to look at something called a convolution.

Convolutional neural networks

And convolutions are at the heart of a convolutional neural network.

[00:46:02]

So the first thing to realize is a convolutional neural network is very, very, very similar to the neural networks we’ve seen so far. It’s got imports, it’s got things that are a lot like or actually are a form of matrix multiplication sandwiched with activation functions, which can be rectified linear. But there’s a particular thing which makes them very useful for computer vision.

Convolutions for computer vision

And I’m going to show you using this Excel spreadsheet that’s in our repo called conv-example.

Using MNIST for convolution example

And we’re going to look at it using an image from MNIST. So MNIST is kind of the world’s most famous computer vision data set, I think, because it was like the first one really which really showed image recognition being cracked. It’s pretty small by today’s standards. It’s a data set of handwritten digits. Each one is 28 by 28 pixels.

[00:47:01]

But back in the mid-90s, Jan LeCun showed really practically useful performance on this data set, and as a result ended up with convnets being used in the American banking system for reading checks. So here’s an example of one of those digits.

Recognizing horizontal and vertical edges in an image

This is a 7 that somebody drew, one of those ones with a stroke through it. And this is what it looks like. This is the image. And so I got it from, this is just one of the images from MNIST which I put into Excel. And what you see in the next column is a version of the image where the horizontal lines are being recognized, and another one where the vertical lines are being recognized. And if you think back to that Zeiler and Fergus paper that talked about what the layers of a neural net does, this is absolutely an example of something that we know that the first layer of a neural network tends to learn how to do.

[00:48:12]

Now how did I do this?

Convolution as a sliding window operation

I did this using something called a convolution. And so what we’re going to do now is we’re going to zoom in to this Excel notebook. We’re going to keep zooming in, we’re going to keep zooming in. So take a look, keep an eye on this image, and you’ll see that once we zoom in enough, it’s actually just made of numbers, which as we discussed in the very first lesson, we saw how images are made of numbers. So here they are, right?

Representing an image as numbers

Here are the numbers between 0 and 1. And what I just did is I just used a little trick. I used Microsoft Excel’s conditional formatting to basically make things, the higher numbers, more red.

[00:49:04]

So that’s how I turned this Excel sheet. And I’ve just rounded it off to the nearest decimal, but they’re actually bigger than that. And so, yeah, so here is the image as numbers. And so let me show you how we went about creating this top edge detector.

Creating an edge detector using convolution

What we did was we created this formula. Don’t worry about the max, let’s focus on this. What it’s doing is, have a look at the colored in areas, it’s taking each of these cells and multiplying them by each of these cells, and then adding them up.

Convolution as a dot product

And then we do the rectified linear part, which is if that ends up less than zero, then make it zero.

[00:50:06]

So this is like a rectified linear unit, but it’s not doing the normal matrix product. It’s doing the equivalent of a dot product, but just on these nine cells and with just these nine weights. So you might not be surprised to hear that if I move now one to the right, then now it’s using the next nine cells. So if I move like to the right quite a bit and down quite a bit to here, it’s using these nine cells. So it’s still doing a dot product, right? Which as we know is a form of matrix multiplication. But it’s doing it in this way where it’s kind of taking advantage of the geometry of this situation, that the things that are close to each other are being multiplied by this consistent group of the same nine weights each time.

[00:51:06]

Convolution as a sliding window of dot products

Because there’s actually 28 by 28 numbers here, right? Which I think is 768. 28 times 28, that’s close enough, 784. But we don’t have 784 parameters, we only have nine parameters. And so this is called a convolution.

Kernel size in convolution

So a convolution is where you basically slide this kind of little three by three matrix across a bigger matrix and at each location you do a dot product of the corresponding elements of that three by three with the corresponding elements of this three by three matrix of coefficients. Now why does that create something that finds, as you see, top edges? Well it’s because of the particular way I constructed this three by three matrix. What I said was that all of the rows just above, so these ones, are going to get a one.

[00:52:09]

And all of the ones just below are going to get a minus one. And all of the ones in the middle are going to get a zero. So let’s think about what happens somewhere like here, right? That is, let’s try to find the right one, here it is. So here we’re going to get 1 times 1 plus 1 times 1 plus 1 times 1 minus 1 times 1 minus 1 times 1 minus 1 times 1, we’re going to get zero. But what about up here? Here we’re going to get 1 times 1 plus 1 times 1 plus 1 times 1, these do nothing because they’re times zero, minus 1 times 0.

[00:53:01]

So we’re going to get 3. So we’re only going to get 3, the highest possible number, in the situation where these are all as black as possible, or in this case as red as possible, and these are all white. And so that’s only going to happen at a horizontal edge. So the one underneath it does exactly the same thing, exactly the same formulas. Oopsie-daisy. The one underneath are exactly the same formulas, a three by three sliding thing here, but this time we’ve got a different matrix, different little mini matrix of coefficients, which is all ones going down and all minus ones going down. And so for exactly the same reason, this will only be three in situations where they’re all one here and they’re all zero here. So you can think of a convolution as being a sliding window of little mini dot products of these little three by three matrices.

[00:54:12]

Varying kernel sizes

And they don’t have to be three by three, right, you could have, we could just have easily done five by five, and then we’d have a five by five matrix of coefficients, or whatever, whatever size you like. So the size of this is called its kernel size. This is a three by three kernel for this convolution. So then, because this is deep learning, we just repeat the, we just repeat these steps again and again and again.

Multiple convolutional layers

So this is, this layer I’m calling conv1, it’s the first convolutional layer. So conv2, it’s going to be a little bit different, because on conv1 we only had a single channel input, it’s just black and white or, you know, yeah, black and white grayscale, one channel.

[00:55:03]

Convolution with multiple channels

But now we’ve got two channels. We’ve got the, let’s make it a little smaller so we can see better, we’ve got the horizontal edges channel and the vertical edges channel. And we’d have a similar thing in the first layer if it’s color, we’d have a red channel, a green channel, and a blue channel. So now our, our filter, this is called the filter, this little mini matrix is called the filter.

Filter in convolution

Our filter, our filter now contains a three by three by depth two, or if you want to think of it another way, two three by three kernels, or one three by three by two kernel. And we basically do exactly the same thing, which is we’re going to multiply each of these by each of these and sum them up.

[00:56:07]

Combining features from multiple channels

But then we do it for the second bit as well. We multiply each of these by each of these and sum them up. And so that gives us, and I think I just picked some random numbers here, right? So this is going to now be something which can combine, oh sorry, the second one, the second set. So it’s, sorry, each of the red ones by each of the blue ones, that’s here, plus each of the green ones times each of the mauve ones, that’s here. So this first filter is being applied to the horizontal edge detector and the second filter is being applied to the vertical edge detector. And as a result, we can end up with something that combines features of the two things. And so then we can have a second channel over here, which is just a different bunch of convolutions for each of the two channels.

[00:57:02]

Multiple convolutional channels

This one times this one. Again, you can see the colors. So what we could do is if, you know, once we kind of get to the end, we’ll end up as I’ll show you how in a moment, we’ll end up with a single set of 10 activations, one per digit we’re recognizing, zero to nine.

Final activations in a convolutional network

Or in this case, I think we could just create one, you know, maybe we’re just trying to recognize nothing but the number seven or not the number seven. So we could just have one activation. And then we would back propagate through this using SGD in the usual way. And that is going to end up optimizing these numbers. So in this case, I manually put in the numbers I knew would create edge detectors. In real life, you start with random numbers, and then you use SGD to optimize these parameters.

[00:58:00]

Okay, so there’s a few things we can do next.

Max pooling in convolutional networks

And I’m going to, I’m going to show you the way that was more common a few years ago. And I’ll explain some changes that have been made more recently. What happened a few years ago was we would then take these, these activations, which as you can see, these activations now are kind of in a grid pattern. And we would do something called max pooling. And max pooling is kind of like a convolution, it’s a sliding window. But this time, as the sliding window goes across, see here, we’re up to here, we don’t do a dot product over a filter.

Max pooling as a sliding window operation

But instead, we just take a maximum. See here, just this is the maximum of these four numbers. And if we go across a little bit, this is the maximum of these four numbers, go across a bit, go across a bit, and so forth. Oh, that goes off the edge. And you can see what happens when this is called a two by two max pooling.

[00:59:08]

Two by two max pooling

So you can see what happens with a two by two max pooling, we end up losing half of our activations on each dimension. So we’re going to end up with only one quarter of the number of activations we used to have.

The purpose of max pooling

And that’s actually a good thing. Because if we keep on doing convolution, max pool, convolution, max pool, we’re going to get fewer and fewer and fewer activations until eventually, we’ll just have one left, which is what we want. That’s effectively what we used to do. But the other thing I mentioned is we didn’t normally keep going until there’s only one left.

Dense layer in convolutional networks

What we used to then do is we’d basically say, okay, at some point, we’re going to take all of the activations that are left.

[01:00:02]

Dot product of max pooled activations

And we’re going to basically just do a dot product of those with a bunch of coefficients, not as a convolution, but just as a normal linear layer. And this is called the dense layer. And then we would add them all up. So we basically end up with our final big dot product of all of the max pooled activations by all of the weights. And we do that for each channel. And so that would give us our final activation. And as I say here, MNIST would actually have 10 activations. So you’d have a separate set of weights for each of the digits you’re predicting, and then softmax after that.

Modern convolutional network architecture

Okay, nowadays, we do things very slightly differently.

Stride two convolutions

Nowadays, we normally don’t have max pool layers. But instead, what we normally do is when we do our sliding window, like this one here, we don’t normally, let’s go back to see.

[01:01:09]

So when I go one to the right, so currently we’re starting in cell column G. If I go one to the right, the next one is column H. And if I go one to the right, the next one starts in column I. So you can see it’s sliding the window every three by three.

Skipping activations in stride two convolutions

Nowadays, what we tend to do instead is we generally skip one. So we would normally only look at every second. So we would, after doing column I, we would skip column J and we’d go straight to column K.

Reducing feature size with stride two convolutions

And that’s called a stride two convolution. We do that both across the rows and down the columns. And what that means is every time we do a convolution, we reduce our effective kind of feature size, grid size, by two on each axis. So it reduces it by four in total.

[01:02:01]

So that’s basically instead of doing max pooling.

Stride two convolutions as an alternative to max pooling

And then the other thing that we do differently is nowadays, we don’t normally have a single dense layer at the end, a single matrix multiplier at the end.

Multiple stride two convolutions

But instead what we do, we generally keep doing stride two convolutions. So each one’s going to reduce the grid size by two by two. We keep going down until we’ve got about a seven by seven grid.

Average pooling in convolutional networks

And then we do a single pooling at the end. And we don’t normally do max pool nowadays.

Average pooling as a global prediction

Instead, we do an average pool. So we average the activations of each one of the seven by seven features. This is actually quite important to know, because if you think about what that means, it means that something like an ImageNet style image detector is going to end up with a seven by seven grid.

[01:03:01]

ImageNet style image detection

And let’s say it’s trying to say, is this a bear? And in each of the parts of the seven by seven grid, it’s basically saying, is there a bear in this part of the photo? Is there a bear in this part of the photo? Is there a bear in this part of the photo? And then it takes the average of those 49 seven by seven predictions to decide whether there’s a bear in the photo.

Average pooling for large objects

That works very well if it’s basically a photo of a bear, right? Because most, you know, if it’s, if the bear is big and takes up most of the frame, then most of those seven by seven bits are bits of a bear. On the other hand, if it’s a teeny tiny bear in the corner, then potentially only one of those 49 squares has a bear in it.

Average pooling for small objects

And even worse, if it’s like a picture of lots and lots of different things, only one of which is a bear. It could end up not being a great bear detector. And so this is where, like, the details of how we construct our model turn out to be important.

[01:04:00]

Choosing between max pooling and average pooling

And so if you’re trying to find, like, just one part of a photo that has a small bear in it, you might decide to use average pool, sorry, maximum pooling instead of average pooling. Because max pooling will just say, I think this is a picture of a bear if any one of those 49 bits of my grid has something that looks like a bear in it.

The importance of model details

So these are, you know, these are potentially important details which often get hand waved over. Although, you know, again, like, the key thing here is that this is happening right at the very end, right? That max pool or that average pool.

Concat pooling in Fast.ai

And actually, Fast.ai handles this for you. We do a special thing which we kind of independently invented. I think we did it first, which is we do both max pool and average pool and we concatenate them together. We call that concat pooling.

[01:05:01]

And that has since been reinvented in at least one paper. And so that means that you don’t have to think too much about it, because we’re going to try both for you, basically.

Convolution as matrix multiplication

So I mentioned that this is actually really just matrix multiplication. And to show you that, I’m going to show you some images created by a guy called Matthew Clinesmith, who did this actually, I think this is in our very first ever course, might have been the part two, first part two course.

Matthew Clinesmith’s visualization of convolution

And he basically pointed out that in a certain way of thinking about it, it turns out that convolution is the same thing as a matrix multiplier. So I want to show you how he shows this. He basically says, okay, let’s take this 3×3 image and a 2×2 kernel containing the coefficients alpha, beta, gamma, delta.

[01:06:05]

Convolution as a sliding window operation

And so in this, as we slide the window over, each of the colors each of the colors are multiplied together. Red by red, plus green by green, plus what is that, orange by orange, plus blue by blue, gives you this. And so to put it another way, algebraically, p equals alpha times a plus beta times b, etc. And so then as we slide to this part, we’re multiplying again, red by red, green by green, and so forth. So we can say q equals alpha times b plus beta times c, etc. And so this is how we calculate a convolution using the approach we just described as a sliding window.

Convolution as a matrix multiplication

But here’s another way of thinking about it.

[01:07:02]

We could say, okay, we’ve got all these different things, a, b, c, d, e, f, g, h, j. Let’s put them all into a single vector. And then let’s create a single matrix that has alpha, alpha, alpha, alpha, beta, beta, beta, etc. And then if we do this matrix multiplied by this vector, we get this, with these gray zeros in the appropriate places, which gives us this, which is the same as this. And so this shows that a convolution is actually a special kind of matrix multiplication.

Convolution as a special case of matrix multiplication

It’s a matrix multiplication where there are some zeros that are fixed and some numbers that are forced to be the same. Now in practice, it’s going to be faster to do it this way, but it’s a useful kind of thing to think about, I think, that just to realize like, oh, it’s just another of these special types of matrix multiplications.

[01:08:22]

Dropout in convolutional networks

Okay. I think, well, let’s look at one more thing. Because there was one other thing that we saw, and I mentioned we would look at in the tabular model, which is called dropout. And I actually have this in my Excel spreadsheet.

Dropout in the conv-example spreadsheet

If you go to the conv example dropout page, you’ll see we’ve actually got a little bit more stuff here. We’ve got the same input as before, and the same first convolution as before, and the same second convolution as before.

[01:09:01]

Randomly deleting activations with dropout

And then we’ve got a bunch of random numbers.

Dropout mask

They’re showing as between zero and one, but they’re actually, that’s just because they’re rounding off, they’re actually random numbers between, you know, that are floats between zero and one. Over here, we’re then saying, if, let’s have a look. So way up here, I’ll zoom in a bit, I’ve got a dropout factor. Let’s change this, say, to 0.5. There we go. So over here, this is something that says, if the random number in the equivalent place is greater than 0.5, then one, otherwise zero.

[01:10:08]

And so here’s a whole bunch of ones and zeros. Now, this thing here is called a dropout mask.

Applying the dropout mask

Now, what happens is we multiply over here, we multiply the dropout mask, and we multiply it by our filtered image. And what that means is we end up with exactly the same image we started with. Here’s the image we started with, but it’s corrupted.

Corrupting activations with dropout

Random bits of it have been deleted. And based on the amount of dropout we use, so if we change it to, say, 0.2, not very much of it’s deleted at all. So it’s still very easy to recognize. Whereas if we use lots of dropouts, say 0.8, it’s almost impossible to see what the number was.

[01:11:03]

And then we use this as the input to the next layer.

The purpose of dropout

So that seems weird. Why would we delete some data at random from our processed image, from our activations after a layer of the convolutions? Well, the reason is that a human is able to look at this corrupted image and still recognize it’s a 7.

Dropout as data augmentation for activations

And the idea is that a computer should be able to as well. And if we randomly delete different bits of the activations each time, then the computer is forced to learn the underlying real representation, rather than overfitting. You can think of this as data augmentation, but it’s data augmentation not for the inputs, but data augmentation for the activations.

[01:12:04]

So this is called a dropout layer.

Dropout for avoiding overfitting

And so dropout layers are really helpful for avoiding overfitting. And you can decide how much you want to compromise between good generalization, so avoiding overfitting, versus getting something that works really well on the training data. And so the more dropout you use, the less good it’s going to be on the training data, but the better it ought to generalize.

The dropout paper by Geoffrey Hinton’s group

And so this comes from a paper by Geoffrey Hinton’s group quite a few years ago now. Ruslan’s now at Apple, I think. And then Kajeski and Hinton went on to found Google Brain.

[01:13:03]

And so you can see here they’ve got this picture of a fully connected neural network, two layers, just like the one we built. And here, look, they’re kind of randomly deleting some of the activations, and all that’s left is these connections. And so that’s a different bunch that’s going to be deleted each batch. I thought this was an interesting point.

Dropout’s origin in a master’s thesis

So dropout, which is super important, was actually developed in a master’s thesis. And it was rejected from the main neural networks conference, then called NIPS, now called NeurIPS. So it ended up being disseminated through Archive, which is a preprint server. And yes, it’s just been pointed out on our chat that Ilya was one of the founders of OpenAI. I don’t know what happened to Nitish.

[01:14:01]

The importance of preprint servers

I think he went to Google Brain as well, maybe. Yeah, so, you know, peer review is a very fallible thing in both directions. And it’s great that we have preprint servers, so we can read stuff like this, even if reviewers decide it’s not worthy. It’s been one of the most important papers ever. Okay.

Reviewing the components of a neural network

Now, I think that’s given us a good tour now. We’ve really seen quite a few ways of dealing with input to a neural network, quite a few of the things that can happen in the middle of a neural network. We’ve only talked about rectified linear units, which is this one here. Zero if x is less than zero, or x otherwise.

Different activation functions

These are some of the other activations you can use.

[01:15:00]

Don’t use this one, of course, because you end up with a linear model. But they’re all just different functions. I should mention, like, it turns out these don’t matter very much.

The importance of non-linearity in activation functions

Basically, pretty much any non-linearity works fine. So we don’t spend much time talking about activation functions, even in part two of the course, just a little bit. So, yeah, so we understand there’s our inputs.

Summarizing the components of a neural network

They can be one-hot encoded, or embeddings, which is a computational shortcut. There are sandwiched layers of matrix multipliers and activation functions. The matrix multipliers can sometimes be special cases, such as the convolutions or the embeddings. The output can go through some tweaking, such as the softmax. And then, of course, you’ve got the loss function, such as cross-entropy loss, or mean squared error, or mean absolute error.

[01:16:03]

The simplicity of neural network operations

But, you know, there’s nothing too crazy going on in there. So I feel like we’ve got a good sense now of what goes inside a wide range of neural nets.

Understanding the inner workings of neural networks

You’re not going to see anything too weird from here. And we’ve also seen a wide range of applications.

What to do after completing Part 1

So, before you come back to do part two, you know, what now? And we’re going to have a little AMA session here.

AMA session

And, in fact, one of the questions was, what now? So this is quite, quite good.

Radek’s book on meta-learning

One thing I strongly suggest is, if you’ve got this far, it’s probably worth you investing your time in reading Radek’s book, which is meta-learning.

[01:17:09]

And so meta-learning is very heavily based on the kind of teachings of fast.ai over the last few years, and is all about how to learn deep learning and learn pretty much anything.

The importance of practice and writing

Yeah, because, you know, you’ve got to this point, you may as well know how to get to the next point as well as possible. And the main thing you’ll see that Radek talks about, or one of the main things, is practicing and writing. So if you’ve kind of zipped through the videos on, you know, 2x and haven’t done any exercises, you know, go back and watch the videos again.

[01:18:06]

Rewatching videos and coding along

You know, a lot of the best students end up watching them two or three times, probably more like three times. And actually go through and code as you watch, you know, and experiment.

Writing blog posts and participating in forums

You know, write posts, blog posts about what you’re doing. Spend time on the forum, both helping others and seeing other people’s answers to questions. Read the success stories on the forum and of people’s projects to get inspiration for things you could try.

The importance of community and study groups

One of the most important things to do is to get together with other people. For example, you couldn’t do, you know, a Zoom study group. In fact, on our Discord, which you can find through our forum, there’s always study groups going on, or you can create your own, you know, a study group to go through the book together.

[01:19:03]

Building projects

Yeah, and of course, you know, build stuff. And sometimes it’s tricky to always be able to build stuff for work, because maybe you’re not quite in the right area, or they’re not quite ready to try out deep learning yet. But that’s okay. You know, build some hobby projects, build some stuff just for fun, or build some stuff that you’re passionate about. Yeah, so it’s really important to, to not just put the videos away and go away and do something else, because you’ll forget everything you’ve learned, and you won’t have practiced. So one of our community members went on to create an activation function, for example.

The MISH activation function

Which is MISH, which is now, as Tanishka has just reminded me on our forums, is now used in many of the state-of-the-art networks around the world, which is pretty cool.

[01:20:15]

This is, and he’s now at Mila, I think, a research, one of the top research labs in the world. I wonder how that’s doing.

Google Scholar citations for MISH

Let’s have a look, go to Google Scholar. Nice, 486 citations. They’re doing great. All right, let’s have a look at how our AMA topic is going, and pick out some of the highest ranked AMAs.

AMA questions

Okay, so the first one is from Lucas, and actually, maybe I should, actually, let’s switch our view here.

[01:21:12]

So our first AMA is from Lucas, and Lucas asks, how do you stay motivated?

Staying motivated in the field

I often find myself overwhelmed in this field. There are so many new things coming up that I feel like I have to put so much energy just to keep my head above the waterline. Yeah, that’s a very interesting question. I mean, I think, Lucas, the important thing is to realize you don’t have to know everything, you know. In fact, nobody knows everything, and that’s okay. What people do is they take an interest in some area, and they follow that, and they try and do the best job they can of keeping up with some little sub-area.

[01:22:06]

Focusing on specific sub-areas

And if your little sub-area is too much to keep up on, pick a sub-sub area. Yeah, there’s no need for it to be demotivating that there’s a lot of people doing a lot of interesting work in a lot of different sub-fields. That’s cool, you know. It is to be kind of dull when there’s only basically five labs in the world working on neural nets. And yeah, from time to time, you know, take a dip into other areas that maybe you’re not following as closely. But when you’re just starting out, you’ll find that things are not changing that fast at all, really.

The pace of change in the field

It can kind of look that way because people are always putting out press releases about their new tweaks. But fundamentally, the stuff that is in the course now is not that different to what was in the course five years ago.

[01:23:01]

The foundations haven’t changed.

The enduring nature of fundamental concepts

And it’s not that different, in fact, to the convolutional neural network that Jan LeCun used on MNIST back in 1996. It’s, you know, the basic ideas I’ve described are forever, you know, the way the inputs work and the sandwiches of matrix multipliers and activation functions and the stuff you do to the final layer. You know, everything else is tweaks. And the more you learn about those basic ideas, the more you’ll recognize those tweaks as simple little tweaks that you’ll be able to quickly get your head around. So then Lucas goes on to ask, or to comment, another thing that constantly bothers me is I feel the field is getting more and more skewed towards bigger and more computationally expensive models and huge amounts of data.

The trend towards larger models and data

I keep wondering if in some years from now, I would still be able to train reasonable models with a single GPU, or if everything is going to require a compute cluster.

[01:24:01]

Yeah, that’s a great question. I get that a lot. But interestingly, you know, I’ve been teaching people machine learning and data science stuff for nearly 30 years.

The history of this question in machine learning

And I’ve had a variation of this question throughout. And the reason is that engineers always want to push the envelope in, like, on the biggest computers they can find.

Engineers’ desire to push the envelope

You know, that’s just this, like, fun thing engineers love to do. And by definition, they’re going to get slightly better results than people doing exactly the same thing on smaller computers. So it always looks like, oh, you need big computers to be state of the art. But that’s actually never true, right?

Smarter solutions over bigger solutions

Because there’s always smarter ways to do things, not just bigger ways to do things.

[01:25:01]

Fast.ai’s DawnBench success

And so, you know, when you look at Fast.ai’s dawn bench success, when we trained ImageNet faster than anybody had trained it before, on standard GPUs, you know, me and a bunch of students, that was not meant to happen. You know, Google was working very hard with their TPU introduction to try to show how good they were. Intel was using, like, 256 PCs in parallel or something. But yeah, you know, we used common sense and smarts and showed what can be done. You know, it’s also a case of picking the problems you solve.

Choosing the right problems to solve

So I would not be probably doing, like, going head to head up against Codex and trying to create code from English descriptions. You know, because that’s a problem that does probably require very large neural nets and very large amounts of data.

[01:26:03]

But if you pick areas in different domains, you know, there’s still huge areas where much smaller models are still going to be state of the art.

Answering Lucas’s question

So hopefully that helped answer your question. Let’s see what else we’ve got here.

Daniel’s question about homeschooling

So Daniel has obviously been following my journey with teaching my daughter math. So I homeschool my daughter. And Daniel asks, how do you homeschool young children, science in general and math in particular? Would you share your experiences by blogging or in lectures someday?

Using computers and tablets in homeschooling

Yeah, I could do that. So I actually spent quite a few months just reading research papers about education recently. So I do probably have a lot I probably need to talk about at some stage.

[01:27:04]

But yeah, broadly speaking, I lean into using computers and tablets a lot more than most people.

The benefits of educational apps

Because actually, there’s an awful lot of really great apps that are super compelling. They’re adaptive, so they go at the right speed for the student. And they’re fun. And I really like my daughter to have fun. You know, I really don’t like to force her to do things.

The importance of fun in learning

And for example, there’s a really cool app called DragonBox Algebra 5+, which teaches algebra to five-year-olds by using a really fun computer game involving helping dragon eggs to hatch.

DragonBox Algebra 5+ app

And it turns out that yeah, algebra, the basic ideas of algebra are no more complex than the basic ideas that we do in other kindergarten math.

[01:28:00]

And all the parents I know of who have given their kids DragonBox Algebra 5+, their kids have successfully learned algebra.

The accessibility of algebra for young children

So that would be an example. But yeah, we should talk about this more at some point.

Discussing homeschooling further

All right, let’s see what else we’ve got here.

Farah’s question about walkthroughs

So Farah says, the walkthroughs have been a game changer for me. The knowledge and tips you shared in those sessions are skills required to become an effective machine learning practitioner and utilize fast.ai more effectively. Have you considered making the walkthroughs a more formal part of the course, doing a separate software engineering course, or continuing live coding sessions between part one and two?

Continuing live coding sessions

So yes, I am going to keep doing live coding sessions. At the moment, we’ve switched to those specifically to focusing on APL. And then in a couple of weeks, they’re going to be going to fast.ai study groups.

[01:29:02]

And then after that, they’ll gradually turn back into more live coding sessions.

The focus of live coding sessions

But yeah, the thing I try to do in my live coding or study groups, whatever, is definitely try to show the foundational techniques that just make life easier as a coder or a data scientist.

Foundational techniques for coders and data scientists

When I say foundational, I mean, yeah, the stuff which you can reuse again and again and again, like learning regular expressions really well, or knowing how to use a VM, or understanding how to use the terminal and command line, you know, all that kind of stuff.

Planning a software engineering course

Never goes out of style, it never gets old. And yeah, I do plan to, at some point, hopefully actually do a course really all about that stuff specifically. But yeah, for now, the best approach is follow along with the live coding and stuff.

[01:30:00]

Wade’s question about turning a model into a business

Okay, wgpubs, which is Wade, asks, how do you turn a model into a business? Specifically, how does a coder with little or no startup experience turn an ML-based Gradio prototype into a legitimate business venture?

Turning a Gradio prototype into a business venture

Okay, I plan to do a course about this at some point as well.

Planning a course on business ventures

So, you know, obviously, there isn’t a two-minute version to this.

The importance of solving a real problem

But the key thing with creating a legitimate business venture is to solve a legitimate problem, you know, a problem that people need solving, and which they will pay you to solve. And so it’s important not to start with your fun Gradio prototype as the basis of your business, but instead start with, here’s a problem I want to solve.

Choosing a problem you understand well

And generally speaking, you should try to pick a problem that you understand better than most people.

[01:31:04]

So it’s either a problem that you face day-to-day in your work, or in some hobby or passion that you have, or that, you know, your club has, or your local school has, or your spouse deals with in their workplace. You know, it’s something where you understand that there’s something that doesn’t work as well as it ought to. Particularly something where you think to yourself, you know, if they just used deep learning here, or some algorithm here, or some better compute here, that problem would go away.

The start of a business

And that’s the start of a business.

The Lean Startup by Eric Reese

And so then my friend Eric Reese wrote a book called The Lean Startup, where he describes what you do next, which is basically you fake it.

Minimum viable product (MVP)

You create, so he calls it the minimum viable product.

[01:32:00]

You create something that solves that problem.

Creating a solution that solves the problem

It takes you as little time as possible to create. It could be very manual. It can be loss making. It’s fine. You know, even the bit in the middle where you’re like, oh, there’s going to be a neural net here. It’s fine to like launch without the neural net and do everything by hand.

Launching without a neural network

You’re just trying to find out if people are going to pay for this, and is this actually useful? And then once you have, you know, hopefully confirmed that the need is real and that people will pay for it, and you can solve the need, you can gradually make it less and less of a fake, you know, and do, you know, more and more getting the product to where you want it to be.

Gradually improving the product

Okay. I don’t know how to pronounce the name.

M-I-W-O-J-C’s question about productivity hacks

M-I-W-O-J-C. M-I-W-O-J-C says, Jeremy, can you share some of your productivity hacks from the content you produce?

[01:33:02]

Working 24 hours a day

It may seem you work 24 hours a day.

Not working too hard

Okay. I certainly don’t do that. I think one of my main productivity hacks actually is not to work too hard, or at least, no, not to work too hard, not to work too much.

Spending less time working than most people

I spend probably less hours a day working than most people, I would guess. But I think I do a one is I’ve spent half, at least half, of every working day since I was about 18, learning or practicing something new.

Spending time learning and practicing

Could be a new language, could be a new algorithm, could be something I read about. And nearly all of that time, therefore, I’ve been doing that thing more slowly than I would if I just used something I already knew, which often drives my co-workers crazy, because they’re like, you know, why aren’t you focusing on getting that thing done?

[01:34:08]

The benefits of slow learning

But in the other 50% of the time, I’m constantly, you know, building up this kind of exponentially improving base of expertise in a wide range of areas.

Building a base of expertise

And so now I do find, you know, I can do things often orders of magnitude faster than people around me, or certainly many multiples faster than people around me, because I, you know, know a whole bunch of tools and skills and ideas which, you know, other people don’t necessarily know.

The importance of sleep, diet, and exercise

So like, I think that’s one thing that’s been helpful. And then another is, yeah, like trying to really not overdo things, like get good sleep and eat well and exercise well. And also I think it’s a case of like tenacity, you know, I’ve noticed a lot of people give up much earlier than I do.

[01:35:08]

Tenacity and finishing things nicely

So yeah, if you, if you just keep going until something’s actually finished, then that’s going to put you in a small minority, to be honest. Most people don’t do that. When I say finish, like finish something really nicely.

Coding related productivity hacks

And I try to make it like, so I particularly like coding, and so I try to do a lot of coding related stuff.

Creating tools to make finishing things easier

So I create things like nbdev, and nbdev makes it much, easier for me to finish something nicely, you know. So in my kind of chosen area, I’ve spent quite a bit of time trying to make sure it’s really easy for me to like get out a blog post, get out a Python library, get out a notebook analysis, whatever. So yeah, trying to make these things I want to do easier, and so then I’ll do them more.

[01:36:04]

Thanking the audience

So well, thank you, everybody. That’s been a lot of fun.

Appreciating the audience’s participation

Really appreciate you taking the time to go through this course with me.

Giving a like on YouTube

Yeah, if you enjoyed it, it would really help if you would give a like on YouTube, because it really helps other people find the course, goes into the YouTube recommendation system.

Helping beginners on forums.fast.ai

And please do come and help other beginners on forums.fast.ai. It’s a great way to learn yourself, is to try to teach other people.

Joining Part 2

And yeah, I hope you’ll join us in part two.

Farewell

Thanks, everybody, very much. I’ve really enjoyed this process, and I hope to get to meet more of you in person in the future. Bye.