Lesson 4: Practical Deep Learning for Coders 2022

[Jeremy Howard]

Introduction to NLP and Transformers

Hi everybody and welcome to Practical Deep Learning for Coders Lesson 4, which I think is the lesson that a lot of the regulars in the community have been most excited about because it’s where we’re going to get some totally new material, totally new topic we’ve never covered before. We’re going to cover natural language processing, NLP, and you’ll find there is indeed a chapter about that in the book, but we’re going to do it in a totally different way to how it’s done in the book. In the book we do NLP using the Fast.ai library, using recurrent neural networks, RNNs. Today we’re going to do something else, which is we’re going to do transformers, and we’re not even going to use the Fast.ai library at all, in fact. So, what we’re going to be doing today is we’re going to be fine-tuning a pre-trained NLP model using a library called Hugging Face Transformers.

[00:01:05]

Why use Hugging Face Transformers?

Now given this is the Fast.ai course, you might be wondering why we’d be using a different library other than Fast.ai. The reason is that I think that it’s really useful for everybody to have experience and practice of using more than one library, because you’ll get to see the same concepts applied in different ways, and I think that’s great for your understanding of what these concepts are.

Hugging Face Transformers: State of the art in NLP

Also, I really like the Hugging Face Transformers library. It’s absolutely the state of the art in NLP, and it’s well worth knowing. If you’re watching this on video, by the time you’re watching it we will probably have completed our integration of the Transformers library into Fast.ai, so it’s in the process of becoming the main NLP foundation for Fast.ai. So you’ll be able to combine Transformers and Fast.ai together.

[00:02:08]

So I think there’s a lot of benefits to this, and in the end you’re going to know how to do NLP in a really fantastic library.

Hugging Face Transformers: Lower level library

Now the other thing is, Hugging Face Transformers doesn’t have the same layered architecture that Fast.ai has, which means, particularly for beginners, the kind of high level, top tier API that you’ll be using most of the time is not as ready to go for beginners as you’re used to from Fast.ai. And so that’s actually, I think, a good thing. You’re up to Lesson 4, you know the basic idea now of how gradient descent works, and you know how parameters are learned as part of a flexible function. I think you’re ready to try using a somewhat lower level library that does a little bit less for you.

[00:03:02]

So it’s going to be a little bit more work, it’s a very well designed library and it’s still reasonably high level, but you’re going to learn to go a little bit deeper. And that’s kind of how the rest of the course in general is going to be on the whole, is we’re going to get a bit deeper and a bit deeper and a bit deeper.

Fine-tuning a pre-trained model

Now so first of all, let’s talk about what we’re going to be doing with fine tuning a pre-trained model. We’ve talked about that in passing before, but we haven’t really been able to describe it in any detail because you haven’t had the foundations. Now you do. You played with these sliders last week, and hopefully you’ve all actually gone into this notebook and dragged them around and tried to get an intuition for this idea of moving them up and down, makes the loss go up and down and so forth. So imagine that your job was to move these sliders to get this as nice as possible, but when it was given to you, the person who gave it to you said, oh, actually slider A, that should be on 2.0. We know for sure.

[00:04:08]

And slider B, we think it’s like around 2.5. Slider C, we’ve got no idea. Now that would be pretty helpful, wouldn’t it? Because you could immediately start focusing on the one we have no idea about, get that in roughly the right spot, and then the one you’ve got a vague idea about, you could just tune it a little bit. And the one that they said was totally confident, you wouldn’t move at all, you’d probably tune these sliders really quickly. That’s what a pre-trained model is. A pre-trained model is a bunch of parameters that have already been fit, where some of them we’re already pretty confident of what they should be, and some of them we really have no idea at all. And so fine-tuning is the process of taking those ones we have no idea what they should be at all, and trying to get them right, and then moving the other ones a little bit.

[00:05:01]

ULMfit: Pioneering fine-tuning

The idea of fine-tuning a pre-trained NLP model in this way was pioneered by an algorithm called ULMfit, which was first presented actually in a fast.ai course, I think the very first fast.ai course. It was later turned into an academic paper by me in conjunction with a then PhD student named Sebastian Ruder, who’s now one of the world’s top NLP researchers, and went on to help inspire a huge change, a huge step improvement in NLP capabilities around the world, along with a number of other important innovations at the time. This is the basic process that ULMfit described. Step one was to build something called a language model, using basically nearly all of Wikipedia. And what the language model did was it tried to predict the next word of a Wikipedia article.

[00:06:06]

In fact, every next word of every Wikipedia article. Doing that is very difficult. There are Wikipedia articles which would say things like, you know, the 17th prime number is dot dot dot, or the 40th president of the United States blah said at his residence blah that. Filling in these kinds of things requires understanding a lot about how language is structured and about the world and about math and so forth. So, to get good at being a language model, a neural network has to get good at a lot of things. It has to understand how language works at a reasonably good level and it needs to understand what it’s actually talking about and what is actually true and what is actually not true and the different ways in which things are expressed and so forth.

[00:07:10]

So, this was trained using a very similar approach to what we’ll be looking at for fine-tuning, but it started with random weights and at the end of it there was a model that could predict more than 30% of the time correctly what the next word of a Wikipedia article would be.

ULMfit: Three steps

So in this particular case for the ULM fit paper, we then took that and we were trying to, the first task I did actually for the fast.ai course back when I invented this was to try and figure out whether IMDb movie reviews were positive or negative sentiment, did the person like the movie or not. So what I did was I created a second language model, so again the language model here is something that predicts the next word of a sentence, but rather than using Wikipedia, I took this pre-trained model that was trained on Wikipedia and I ran a few more epochs using IMDb movie reviews.

[00:08:09]

So it got very good at predicting the next word of an IMDb movie review. And then finally I took those weights and I fine-tuned them for the task of predicting whether or not a movie review was positive or negative sentiment. So those were the three steps. This is a particularly interesting approach because this very first model, in fact the first two models, if you think about it they don’t require any labels, I didn’t have to collect any kind of document categories or do any kind of surveys or collect anything, all I needed was the actual text of Wikipedia and movie reviews themselves, because the labels was what’s the next word of a sentence. Now, since we built ULMfit, and we used RNNs, recurrent neural networks, for this, at about the same time-ish that we released this, a new kind of architecture, particularly useful for NLP at the time, was developed called transformers.

[00:09:17]

Transformers: Advantage of modern accelerators

And transformers were particularly built because they can take really good advantage of modern accelerators like Google’s TPUs. They didn’t really kind of allow you to predict the next word of a sentence. It’s just not how they’re structured, for reasons we’ll talk about probably in part two of the course. So they threw away the idea of predicting the next word of a sentence, and then instead they did something just as good, and pretty clever, they took kind of chunks of Wikipedia or whatever text they’re looking at, and deleted at random a few words, and asked the model to predict what were the words that were deleted, essentially.

[00:10:03]

Masked language models

So it’s a pretty similar idea. Other than that, the basic concept was the same as ULMfit, they replaced our RNN approach with a transformer model, they replaced our language model approach with what’s called a masked language model, but other than that the basic idea was the same. So today we’re going to be looking at models using what’s become the much more popular approach than ULMfit, which is this transformers masked language model approach. Okay, John, do we have any questions?

Question: How to go from a model trained to predict the next word to a model for classification?

And I should mention we do have a professor from the University of Queensland, John Williams, joining us, who will be asking the highest voted questions from the community. What have you got, John?

[John Williams]

Yeah, thanks, Jeremy, and we might be jumping the gun here, I suspect this is where you’re going tonight, but we’ve got a good question here on the forum, which is, how do you go from a model that’s trained to predict the next word, to a model that can be used for classification?

[00:11:04]

[Jeremy Howard]

Visualizing layers of an ImageNet classification model

Sure. So, yeah, we will be getting into that in more detail, and in fact maybe a good place to start would be the next slide, kind of give you a sense of this. You might remember in Lesson 1, we looked at this fantastic Zeiler and Fergus paper, where we looked at visualizations of the first layer of an ImageNet classification model. And Layer 1 had sets of weights that found diagonal edges, and here are some examples of bits of photos that successfully matched with, and opposite diagonal edges, and kind of color gradients, and here’s some examples of bits of pictures that matched. And then Layer 2 combined those, and now you know how those were combined, right? These were rectified linear units that were added together, okay, and then sets of those rectified linear units, the outputs of those, they’re called activations, were then themselves

[00:12:02]

run through a matrix multiplier, a rectified linear unit added together, so that now you don’t just have to have edge detectors, but Layer 2 had corner detectors, and here’s some examples of some corners that that corner detector successfully found. And remember, these were not engineered in any way, they just evolved from the gradient descent training process. Layer 2 had examples of circle detectors, as it turns out. And skipping a bit, by the time we got to Layer 5, we had bird and lizard eyeball detectors, and dog face detectors, and flower detectors, and so forth. Now, nowadays you’d have something like a ResNet-50, would be something you’d probably be training pretty regularly in this course, so that you’ve got 50 layers, not just 5 layers. Now, the later layers do things that are much more specific to the training task, which is like actually predicting really what is it that we’re looking at.

[00:13:09]

The early layers, pretty unlikely you’re going to need to change them much, as long as you’re looking at some kind of natural photos, right, you’re going to need edge detectors and gradient detectors. So what we do in the fine-tuning process is, there’s actually one extra layer after this, which is the layer that actually says, what is this? It’s a dog or a cat or whatever. We actually delete that, we throw it away. So now that last matrix multiply has one output, or one output per category you’re predicting, we throw that away. So the model now has that last matrix that’s spitting out, it depends, but generally a few hundred activations. And what we do is, as we’ll learn more shortly in the coming lesson, we just stick a new random matrix on the end of that.

[00:14:03]

Fine-tuning process: Adding a new random matrix

And that’s what we initially train. So it learns to use these kinds of features to predict whatever it is you’re trying to predict. And then we gradually train all of those layers. So that’s basically how it’s done. It’s a bit hand-wavy, but we’ll, particularly in Part 2, actually build that from scratch ourselves. In fact, in this lesson, time permitting, we’re actually going to start going down the process of actually building a real-world deep neural net in Python. So we’ll be starting to actually make some progress towards that goal. Okay, so let’s jump into the notebook.

Kaggle competition: US Patent Phrase-to-Phrase Matching

So we’re going to look at a Kaggle competition that’s actually on as I speak. And I created this notebook called Getting Started with NLP for Absolute Beginners. And so the competition is called the US Patent Phrase-to-Phrase Matching Competition.

[00:15:07]

Kaggle competitions: Real-world data and problems

And so I’m going to take you through a complete submission to this competition. And Kaggle competitions are interesting, particularly the ones that are not playground competitions, but the real competitions with real money applied. They’re interesting because this is an actual project that an actual organisation is prepared to invest money in getting solved using their actual data. So a lot of people are a bit dismissive of Kaggle competitions as being kind of like not very real, and it’s certainly true, you’re not worrying about stuff like productionising the model. But in terms of getting real data about a real problem that real organisations really care about and a very direct way to measure the accuracy of your solution, you can’t really get better than this. So this is a good place, a good competition to experiment with for trying NLP.

[00:16:03]

NLP classification: Sentiment analysis, author identification, legal discovery, etc.

Now as I mentioned here, probably the most widely useful application for NLP is classification. And as we’ve discussed in computer vision, classification refers to taking an object and trying to identify a category that object belongs to. So previously we’ve mainly been looking at images, today we’re going to be looking at documents. Now in NLP, when we say document, we don’t specifically mean a 20 page long essay. A document could be 3 or 4 words, or a document could be the entire encyclopedia. So a document is just an input to an NLP model that contains text. Now, classifying a document, so deciding what category a document belongs to, is a surprisingly rich thing to do.

[00:17:00]

There’s all kinds of stuff you could do with that. So for example, we’ve already mentioned sentiment analysis. That’s a classification task, we try to decide on the category, positive or negative sentiment. Author identification would be taking a document and trying to find the category of author. Legal discovery would be taking documents and putting them into categories according to in or out of scope for a court case. Triage in inbound emails would be putting them into categories of, you know, throw away, send to customer service, send to sales, etc. So classification is a very, very rich area. And for people interested in trying out NLP in real life, I would suggest classification would be the place I would start for looking for accessible, real world, useful problems you can solve right away. Now, the Kaggle competition does not immediately look like a classification competition.

[00:18:04]

Kaggle competition data: Anchor, target, context, and score

What it contains, let me show you some data. What it contains is data that looks like this. It has a thing they call anchor, a thing they call target, a thing they call context, and a score. Now, these are, I can’t remember the exact details, but I think these are from patents. And I think on the patents there are various, like, things they have to fill in, in the patent. And one of those things is called anchor, one of those things is called target. And in the competition, the goal is to come up with a model that automatically determines which anchor and target pairs are talking about the same thing. So a score of one here, wood article and wooden article, obviously talking about the same thing. A score of zero here, abatement and forest region, not talking about the same thing.

[00:19:02]

So the basic idea is we’re trying to guess the score. And it’s kind of a classification problem, kind of not.

Turning similarity problem into a classification problem

We’re basically trying to classify things into either these two things are the same or these two things aren’t the same. It’s kind of not, because we have not just one and zero, but also 0.25, 0.5, and 0.75. There’s also a column called context, which is, I believe, the category that this patent was filed in. And my understanding is that whether the anchor and the target count as similar or not depends on what the patent was filed under. So how could we take this and turn it into something like a classification problem? So the suggestion I make here is that we could basically say, okay, let’s put some constant string like text1 or field1 before the first column, and then something else like text2 before the second column.

[00:20:11]

Maybe also the context I should have as well, text3 in the context. And then try to choose a category of meaning similarity, different, similar, or identical. So we could basically concatenate those three pieces together, call that a document, and then try to train a model that can predict these categories. That would be an example of how we can take this, basically, similarity problem and turn it into something that looks like a classification problem.

Deep learning: Turning novel problems into familiar ones

And we tend to do this a lot in deep learning, is we kind of take problems that look a bit novel and different and turn them into a problem that looks like something we recognise. So on Kaggle, this is a larger data set that you’re going to need a GPU to run.

[00:21:05]

Kaggle: Using GPU

So you can click on the accelerator button and choose GPU to make sure that you’re using a GPU. If you click copy and edit on my document, I think that will happen for you automatically.

Paperspace: Alternative to Kaggle

Personally, I like using things like Paperspace generally better than Kaggle. Like, Kaggle’s pretty good, but you only get 30 hours a week of GPU time and the notebook editor for me is not as good as a real JupyterLab environment. So there’s some information here I won’t go through, but it basically describes how you can download stuff to Paperspace or your own computer as well if you want to. So I basically create this little boolean always in my notebooks called isKaggle which is going to be true if it’s running on Kaggle and false otherwise and any little changes I need to make, I’ll say if isKaggle and put those changes.

[00:22:05]

Downloading data and installing packages

So here you can see here if I’m not on Kaggle and I don’t have the data yet, then download it. And Kaggle has a little API which is quite handy for doing stuff like downloading data and uploading notebooks and stuff like that, submitting to competitions. If we are on Kaggle, then the data’s already going to be there for us, which is actually a good reason for beginners to use Kaggle as you don’t have to worry about grabbing the data at all. It’s sitting there for you as soon as you open the notebook. Kaggle has a lot of Python packages installed, but not necessarily all the ones you want. At the point I wrote this, they didn’t have HuggingFace’s datasets package for some reason, so you can always just install stuff. So you might remember the exclamation mark means this is not a Python command, but a shell command, a bash command. But it’s quite neat, you can even put bash commands inside Python conditionals, so that’s a pretty cool little trick in notebooks.

[00:23:09]

Notebook tricks: Bash commands and variables

Another cool little trick in notebooks is that if you do use a bash command, like ls, but you then want to insert the contents of a Python variable, just chuck it in parentheses. So I’ve got a Python variable called path, and I can go ls path in parentheses and that will ls the contents of the Python variable path. So there’s another little trick for you. So when we ls that, we can see that there’s some CSV files. So what I’m going to do is kind of take you through roughly the process, the kind of process I went through when I first look at a competition.

Understanding a dataset: CSV files and Pandas

So the first thing is, or any dataset indeed, what’s in it. Okay, so it’s got some CSV files. As well as looking at it here, the other thing I would do is I would go to the competition website, and if you go to data, a lot of people skip over this, which is a terrible idea because it actually tells you what the dependent variable means, what the different files are, what the columns are, and so forth.

[00:24:17]

So don’t just rely on looking at the data itself, but look at the information that you’re given about the data. So for CSV files, CSV files are comma-separated values. So they’re just text files with a comma between each field. And we can read them using Pandas, which for some reason is always called PD.

Key libraries for data science: NumPy, matplotlib, Pandas, and PyTorch

Pandas is one of, I guess, probably like four key libraries that you have to know to do data science in Python. And specifically, those four libraries are NumPy, matplotlib, Pandas, and PyTorch.

[00:25:18]

So NumPy is what we use for basic kind of numerical programming, matplotlib we use for plotting, Pandas we use for tables of data, and PyTorch we use for deep learning. Those are all covered in a fantastic book by the author as Pandas, which the new version is actually available for free, I believe, Python for Data Analysis. So if you’re not familiar with these libraries, just read the whole book. It doesn’t take too long to get through and it’s got lots of cool tips and it’s very readable.

[00:26:04]

Importance of fundamental libraries

I do find a lot of people doing this course, often I see people kind of trying to jump ahead and want to be like, oh I want to know how to create a new architecture or build a speech recognition system or whatever, but it then turns out that they don’t know how to use these fundamental libraries. So it’s always good to be bold and be trying to build things, but do also take the time to make sure you finish reading the first AI book and read at least Wes McKinney’s book. That would be enough to really give you all the basic knowledge you need, I think. So with Pandas we can read a CSV file and that creates something called a data frame, which is just a table of data, as you see.

Pandas data frame: Describing data

So now that we’ve got a data frame, we can see what we’re working with and when in Jupyter we just put the name of a variable containing a data frame, we get the first 5 rows, the last 5 rows and the size, so we’ve got 36,473 rows.

[00:27:07]

Okay, so other things I like to use for understanding a data frame is the describe method. If you pass include equals object, that will describe basically all the string fields, the non-numeric fields. So in this case there’s 4 of those, and so you can see here that that anchor field we looked at, there’s actually only 733 unique values. Okay, so this thing, you can see that there’s lots of repetition out of 36,000. So there’s lots of repetition. This is the most common one, it appears 152 times. And then context, we also see lots of repetition, there’s 106 of those contexts. So this is a nice little method, we can see a lot about the data in a glance. And when I first saw this in this competition I thought, well this is actually not that much language data when you think about it.

[00:28:06]

Key features of the dataset: Short documents and repetition

Each document is very short, you know, 3 or 4 words really, and lots of it is repeated. So that’s like, as I’m looking through it I’m thinking like, what are some key features of this data set, and that would be something I’d be thinking, wow, we’ve got to do a lot with not very much unique data here. So here’s how we can just go ahead and create a single string like I described, which contains some kind of field separator, plus the context, the target, and the anchor.

Creating a single string with field separators

So we’re going to pop that into a field called input. Something slightly weird in Pandas is there’s two ways of referring to a column. You can use square brackets and a string to get the input column, or you can just treat it as an attribute. When you’re setting it, you should always use the form seen here.

[00:29:04]

When reading it, you can use either. I tend to use this one because it’s less typing. So you can see now we’ve got these concatenated rows. So head is the first few rows. So we’ve now got some documents to do NLP with. Now the problem is, as you know from the last lesson, neural networks work with numbers.

Neural networks work with numbers

We’re going to take some numbers and we’re going to multiply them by matrices. We’re going to replace the negatives with zeros and add them up, and we’re going to do that a few times. That’s our neural network, with some little wrinkles, but that’s the basic idea. So how on earth do we do that for these strings? So there’s basically two steps we’re going to take.

Tokenization and numericalization

The first step is to split each of these into tokens. Tokens are basically words.

[00:30:02]

We’re going to split it into words. There’s a few problems with splitting things into words. The first is that some languages, like Chinese, don’t have words, or at least certainly not space-separated words, and in fact in Chinese sometimes it’s a bit fuzzy to even say where a word begins and ends, and some words are kind of not even, the pieces are not next to each other.

Tokenization: Splitting into words or subwords

Another reason is that what we’re going to be doing is after we’ve split it into words, or something like words, we’re going to be getting a list of all of the unique words that appear, which is called the vocabulary. And every one of those unique words is going to get a number. As you’ll see later on, the bigger the vocabulary, the more memory is going to get used, the more data we’ll need to train. In general we don’t want a vocabulary to be too big.

[00:31:02]

So instead, nowadays people tend to tokenise into something called subwords, which is pieces of words. So I’ll show you what it looks like. So the process of turning it into smaller units, like words, is called tokenisation, and we call them tokens instead of words. A token is just like the more general concept of whatever we’re splitting it into. So we’re going to get Hugging Face Transformers and Hugging Face Datasets doing our work for us.

Hugging Face Transformers and Datasets

And so what we’re going to do is we’re going to turn our pandas dataframe into a Hugging Face Datasets dataset. It’s a bit confusing. PyTorch has a class called Dataset and Hugging Face has a class called Dataset, and they’re different things. So this is a Hugging Face Datasets dataset.

Turning a Pandas dataframe into a Hugging Face Datasets dataset

So we can turn a dataframe into a dataset just using the from pandas method.

[00:32:01]

And so we’ve now got a dataset. So if we take a look, it just tells us it’s got these features, and remember input is the one we just created with the concatenated strings. And here’s those 36,000 rows. So now we’ve got to do these two things, tokenisation, which is to split each text up into tokens, and then numericalisation, which is to turn each token into its unique ID based on where it is in the vocabulary.

Tokenization and numericalization: Splitting into tokens and turning them into numbers

The vocabulary, remember, being the list of unique tokens. Now particularly in this stage, tokenisation, there’s a lot of little decisions that have to be made. The good news is you don’t have to make them, because whatever pre-trained model you used, the people that pre-trained it made some decisions, and you’re going to have to do exactly the same thing, otherwise you’ll end up with a different vocabulary to them, and that’s going to mess everything up.

[00:33:04]

So that means before you start tokenising, you have to decide on what model to use.

Hugging Face Model Hub: Hundreds of models

Hugging Face Transformers is a lot like Tim. It has a library of, I believe, hundreds of models. I guess I shouldn’t say Hugging Face Transformers, it’s really the Hugging Face Model Hub. 44,000 models, so many more even than Tim’s image models. And so these models, they vary in a couple of ways. There’s a variety of different architectures, just like in Tim, but then something which is different to Tim is that each of those architectures can be trained on different corpuses for solving different problems.

Pre-trained models: Patent models

So for example, I could type patent and see if there’s any pre-trained patent. There is. So there’s a patent, there’s a whole lot of pre-trained patent models. Isn’t that amazing? So quite often, thanks to the Hugging Face Model Hub, you can start your pre-trained model with something that’s actually pretty similar to what you actually want to do, or at least was trained on the same kind of documents.

[00:34:15]

Having said that, there are some just generally pretty good models that work for a lot of things a lot of the time.

DiBerta v3: A good starting point for NLP

And DiBerta v3 is certainly one of those. This is a very new area. NLP has been practically really effective for general users for only a year or two, whereas for computer vision it’s been quite a while. So you’ll find that a lot of things aren’t as quite well bedded down.

[00:35:00]

I don’t have a picture to show you of which models are the best or the fastest and the most accurate and whatever. A lot of this stuff is stuff that we’re figuring out as a community using competitions like this, in fact. And this is one of the first NLP competitions, actually, in the modern NLP era. So we’ve been studying these competitions closely, and I can tell you that DiBerta is actually a really good starting point for a lot of things, so that’s why we’ve picked it. So we pick our model.

Model sizes: Small, medium, large

And just like in TIM for image, our models are often going to be a small, a medium, a large. And of course we should start with small, because small is going to be faster to train, we’re going to be able to do more iterations, and so forth. So at this point, remember, the only reason we picked our model is because we have to make sure we tokenize in the same way.

[00:36:06]

AutoTokenizer: Tokenizing the same way as the pre-trained model

To tell Transformers that we want to tokenize the same way that the people that built the model did, we use something called AutoTokenizer. It’s nothing fancy, it’s basically just a dictionary which says, oh, which model uses which tokenizer. So when we say AutoTokenizer from pre-trained, it will download the vocabulary and the details about how this particular model tokenized a dataset. So at this point, we can now take that tokenizer and pass a string to it.

Tokenizing a string

So if I pass the string, g’day folks, I’m Jeremy from fast.ai, you’ll see it’s kind of putting it into words, kind of not. So if you’ve ever wondered whether g’day is one word or two, it’s actually three tokens according to this tokenizer.

[00:37:01]

And I’m is three tokens. And fast.ai is three tokens. This punctuation is a token, so you kind of get the idea.

Tokenization: Underscores represent the start of a word

These underscores here, that represents the start of a word. So there’s this concept that the start of a word is kind of part of the token. So if you see a capital I in the middle of a word versus the start of a word, that kind of means a different thing. So this is what happens when we tokenize this sentence using the tokenizer that the Deberta V3 developers used. So here’s a less common, unless you’re a big platypus fan like me, less common sentence. A platypus is an ornithorynchus anathenus. And so, okay, in this particular vocabulary, platypus got its own word, its own token, but ornithorynchus didn’t. And so I still remember grade one, for some reason our teacher got us all to learn how to spell ornithorynchus, so one of my favorite words.

[00:38:07]

So you can see here it’s been split into or, ne, tho, rynch, us.

Vocabulary: List of unique tokens

So every one of these tokens you see here is going to be in the vocabulary, right? The list of unique tokens that was created when this particular model, this pre-trained model was first trained. So somewhere in that list we’ll find underscore capital A, and it’ll have a number. And so that’s how we’ll be able to turn these into numbers. So this first process is called tokenization, and then the thing where we take these tokens and turn them into numbers is called numericalization.

Tokenization and numericalization: Turning tokens into numbers

So our data set, remember we put our string into the input field. So here’s a function that takes a document, grabs its input, and tokenizes it.

[00:39:04]

Okay, so we’ll call this our tokenization function. Tokenization can take a minute or two, so we may as well get all of our processes doing it at the same time to save some time.

Parallelizing tokenization with dataset.map

So if you use the dataset.map, it will parallelize that process and just pass in your function. Make sure you pass batched equals true so it can do a bunch at a time. And behind the scenes this is going through something called the tokenizers library, which is a pretty optimized Rust library that uses, you know, SIMD and parallel processing and so forth. So with batched equals true it’ll be able to do more stuff at once. So look, it only took six seconds, so pretty fast. So now when we look at a row of our tokenized dataset, it’s going to contain exactly the same as our original dataset. No, sorry, it’s not going to contain exactly the same as our original dataset. It’s going to contain exactly the same input as our original dataset, and it’s also going to contain a bunch of numbers.

[00:40:05]

Tokenized dataset: Input and numbers representing token positions

These numbers are the position in the vocabulary of each of the tokens in the string. So we’ve now successfully turned a string into a list of numbers. So that is a great first step. So we can see how this works. We can see, for example, that we’ve got of at a separate word, so that’s going to be an underscore of in the vocabulary. We can grab the vocabulary, look up of, find that it’s 265, and check here, yep, here it is, 265. So it’s not rocket science, right, it’s just looking stuff up in a dictionary to get the numbers. Okay, so that is the tokenization and numericalization necessary in NLP to turn our documents into numbers to allow us to put it into our model.

[00:41:09]

Any questions so far, John?

[John Williams]

Question: How to choose keywords and order of fields?

Excuse me, yeah, thanks, Jeremy. So there’s a couple and this seems like a good time to throw them out, and it’s related to how you’ve formatted your input data into these sentences that you’ve just tokenized. So one question was really about how you choose those keywords and the order of the fields that you, so I guess just interested in an explanation, is it more art or science how you choose the keywords?

[Jeremy Howard]

Choosing keywords: Arbitrary and flexible

No, it’s arbitrary. I tried a few things, I tried X, I tried putting them backwards, it doesn’t matter. We just want some way, something that it can learn from. So if I just concatenated it without these headers before each one, it wouldn’t know where abatement of pollution ended and where abatement started.

[00:42:02]

So I did just something that it can learn from. This is the nice thing about neural nets, they’re so flexible. As long as you give it the information somehow, it doesn’t really matter how you give it the information as long as it’s there. I could have used punctuation, I could have put, I don’t know, one semicolon here and two here and three here. It’s not a big deal. At the level where you’re trying to get an extra half a percent to get up the leaderboard on Kaggle competition, you may find tweaking these things makes tiny differences, but in practice, you won’t generally find it matters too much.

[John Williams]

Question: Special handling for long fields?

Right, thank you. And I guess the second part of that, excuse me again, somebody’s asking if one of their fields was particularly long, say it was a thousand characters, is there any special handling required there? Do you need to re-inject those kind of special marker tokens?

[00:43:03]

Does it change if you’ve got much bigger fields that you’re trying to learn and query?

[Jeremy Howard]

Long documents and ULMfit: No special consideration

Yeah. Long documents and ULM fit require no special consideration. So IMDb in fact has multi-thousand word movie reviews and it works great. To this day, ULM fit is probably the best approach for reasonably quickly and easily using large documents. Otherwise, if you use transformer-based approaches, large documents are challenging.

Transformers and large documents: Challenging

Specifically, transformers basically have to do the whole document at once, or else ULM fit can split it into multiple pieces and read it gradually. And so that means you’ll find that people trying to work with large documents tend to spend a lot of money on GPUs because they need the big fancy ones with lots of memory.

[00:44:01]

So, yeah, generally speaking, I would say if you’re trying to do stuff with documents of over 2,000 words, you might want to look at ULM fit. Try transformers, see if it works for you, but I’d certainly try both. For under 2,000 words, transformers should be fine unless you’ve got nothing but a laptop GPU or something with not much memory.

Hugging Face Transformers: Expectations about data

So, hacking-faced transformers has these, as I say it right now, I find them somewhat obscure and not particularly well-documented expectations about your data that you kind of have to figure out. And one of those is that it expects that your target is a column called labels.

Target column: Labels

So once I figured that out, I just went and got our tokenized data set and renamed our score column to labels and everything started working.

[00:45:02]

So it probably is, you know, I don’t know if at some point they’ll make this a bit more flexible, but probably best to just call your target labels and life will be easy. You might have seen back when I drew an LS path that there was another data set there called test.csv. And if you look at it, it looks a lot like our training set, our other CSV that we’ve been working with, but it’s missing the score, the labels.

Test set: Separate data for evaluating model generalization

This is called a test set. And so we’re going to talk a little bit about that now because my claim here is that perhaps the most important idea in machine learning is the idea of having separate training, validation, and test data sets.

[00:46:03]

Overfitting: Identifying and controlling

So test and validation sets are all about identifying and controlling for something called overfitting. And we’re going to try and learn about this through example. So this is the same information that’s in that Kaggle notebook, I’ve just put it on some slides here. So I’m going to create a function here called plotPoly and I’m actually going to use the same data that, I don’t know if you remember, we used earlier for trying to fit this quadratic.

Plotting polynomials to illustrate overfitting

We created some x and some y data. This is the data we’re going to use and we’re going to use this to look at overfitting. So the details of this function don’t matter too much. What matters is what we do with it, which is that it allows us to basically pass in the degree of a polynomial.

[00:47:07]

So for those of you that remember, a first degree polynomial is just a line, it’s y equals ax. A second degree polynomial will be y equals a squared x plus bx plus c. A third degree polynomial will have a cubic, fourth degree, you know, quartic and so forth. And what I’ve done here is I’ve plotted what happens if we try to fit a line to our data. It doesn’t fit very well. So what happened here is we did a linear regression and what we’re using here is a very cool library called Scikit-learn.

Scikit-learn: Library for classic machine learning methods

Scikit-learn is something that, I think it would be fair to say it’s mainly designed for kind of classic machine learning methods, like linear regression and stuff like that. Very advanced versions of these things, but it’s also great for doing these quick and dirty things.

[00:48:02]

So in this case I went in to do what’s called a polynomial regression, which is fitting a polynomial to data, and it’s just these two lines of code. It’s a super nice library. So in this case a degree one polynomial is just a line, so I fit it and then I show it with the data and there it is. Now that’s what we call underfit, which is to say there’s not enough kind of complexity in this model I fit to match the data that’s there.

Underfitting: Model too simple to match data

So an underfit model is a problem. It’s got to be systematically biased. All the stuff up here we’re going to be predicting too low, all the stuff down here we’re predicting too low, all the stuff in the middle we’re predicting too high. A common misunderstanding is that simpler models are kind of more reliable in some way, but models that are too simple will be systematically incorrect, as you see here.

[00:49:00]

What happens if we fit a 10 degree polynomial? That’s not great either. In this case it’s not really showing us what the actual, remember this was originally a quadratic because this was meant to match, right? Particularly at the ends here, it’s predicting things that are way above what we would expect in real life. And it’s trying really hard to get through this point, but clearly this point was just some noise.

Overfitting: Model fits training data too well, but generalizes poorly

So this is what we call overfit. It’s done a good job of fitting to our exact data points, but if we sample some more data points from this distribution, honestly we probably would suspect they’re not going to be very close to this, particularly if they’re a bit beyond the edges. So that’s what overfitting looks like. We don’t want underfitting or overfitting. Now, underfitting is actually pretty easy to recognise, because we can actually look at our training data and see that it’s not very close.

[00:50:02]

Overfitting is a bit harder to recognise, because the training data is actually very close. Now on the other hand, here’s what happens if we fit a quadratic. And here I’ve got both the real line and the fit line, and you can see they’re pretty close. And that’s of course what we actually want. So how do we tell whether we have something more like this, or something more like this? Well what we do is we do something pretty straightforward.

Validation set: Data not used for training, but used for measuring accuracy

We take our original dataset, these points, and we remove a few of them, let’s say 20% of them. We then fit our model using only those points we haven’t removed.

[00:51:02]

And then we measure how good it is by looking at only the points we removed. So in this case, let’s say we had removed, I’m just trying to think, if I had removed this point here, then it might have gone off down over here, and so then when we look at how well it fits, we would say, oh, this one’s miles away. The data that we take away, and don’t let the model see it when it’s training, is called the validation set. So in Fast.ai, we’ve seen splitters before, right? The splitters are the things that separate out the validation set.

Fast.ai: Always uses a validation set

Fast.ai won’t let you train a model without a validation set. Fast.ai always shows you your metrics, so things like accuracy, measured only on the validation set. This is really unusual. Most libraries make it really easy to shoot yourself in the foot by not having a validation set or accidentally not using it correctly, so Fast.ai won’t even let you do that.

[00:52:09]

So you’ve got to be particularly careful when using other libraries. Huckingface Transformers is good about this, so they make sure that they do show you your metrics on a validation set. Now creating a good validation set is not generally as simple as just randomly pulling some of your data out of your model, out of the data that you train for your model.

Creating a good validation set: Not just random removal

The reason why is, imagine that this was the data that you were trying to fit something to, and you randomly removed some, so it looks like this. That looks very easy, doesn’t it? Because you’ve kind of still got all the data you’d want around the points, and in a time series like this, this is dates and sales, in real life you’re probably going to want to predict future dates.

[00:53:07]

So if you created your validation set by randomly removing stuff from the middle, it’s not really a good indication of how you’re going to be using this model. Instead, you should truncate and remove the last couple of weeks. So if this was your validation set, and this was your training set, that’s going to be actually testing whether you can use this to predict the future, rather than using it to predict the past. Kaggle competitions are a fantastic way to test your ability to create a good validation set, because Kaggle competitions only allow you to submit generally a couple of times a day.

Kaggle competitions: Testing ability to create a good validation set

The data set that you are scored on in the leaderboard during that time is actually only a small subset. In fact, it’s a totally separate subset to the one you’ll be scored on at the end of the competition.

[00:54:01]

Kaggle: Overfitting and the importance of a test set

And so most beginners on Kaggle overfit. And it’s not until you’ve done it that you’ll get that visceral feeling of like, oh my god, I overfit. In the real world, outside of Kaggle, you will often not even know that you overfit. You just destroy value for your organization silently. So it’s a really good idea to do this kind of stuff on Kaggle a few times first in real competitions to really make sure that you are confident you know how to avoid overfitting, how to find a good validation set, and how to interpret it correctly. And you really don’t get that until you screw it up a few times. A good example of this was there was a distracted driver competition on Kaggle. There were these kind of pictures from inside a car. And the idea was that you had to try and predict whether somebody was driving in a distracted way or not.

[00:55:02]

And on Kaggle, they did something pretty smart.

Kaggle competitions: Test sets with unseen data

The test set, so the thing that they scored you on the leaderboard, contained people that didn’t exist at all in the competition data that you trained the model with. So if you wanted to create an effective validation set in this competition, you would have to make sure that you separated the photos so that your validation set contained photos of people that aren’t in the data you’re training your model on. There was another one like that, the Kaggle Fisheries competition, which had boats that didn’t appear. So there were basically pictures of boats and you were meant to try to guess, predict what fish were in the pictures. And it turned out that a lot of people accidentally figured out what the fish were by looking at the boat, because certain boats tended to catch certain kinds of fish. And so by messing up their validation set, they were really overconfident of the accuracy of their model.

[00:56:03]

Cross-validation: Not about building a good validation set

I’ll mention in passing, if you’ve been around Kaggle a bit, you’ll see people talk about cross-validation a lot. I’m just going to mention, be very, very careful. Cross-validation is explicitly not about building a good validation set, so you’ve got to be super, super careful if you ever do that. Another thing I’ll mention is that Scikit-learn conveniently offers something called train test split, as does Hugging Face Datasets, as does Fast.ai, we have something called random splitter. It can almost feel like it’s encouraging you to use a randomized validation set because there are these methods that do it for you. But be very, very careful, because very, very often that’s not what you want.

Test set: Data not used for training or validation, used for final evaluation

Okay, so we’ve learnt what a validation set is, so that’s the bit that you pull out of your data that you don’t train with, but you do measure your accuracy with.

[00:57:06]

Validation set: Measuring metrics

So what’s a test set? It’s basically another validation set, but you don’t even use it for tracking your accuracy while you build your model. Why not? Well, imagine you try two new models every day for three months. That’s how long a Kaggle competition goes for. So you would have tried 180 models, and then you look at the accuracy on the validation set for each one. Some of those models, you would have got a good accuracy on the validation set, potentially, because of pure chance, just a coincidence. And then you get all excited and you submit that to Kaggle and you think you’re going to win the competition, and you mess it up. And that’s because you actually overfit using the validation set. So you actually want to know whether you’ve really found a good model or not. So in fact, on Kaggle they have two test sets.

[00:58:02]

They’ve got the one that gives you feedback on the leaderboard during the competition, and a second test set which you don’t get to see until after the competition is finished. So in real life, you’ve got to be very careful about this, not to try so many models during your model building process that you accidentally find one that’s good by coincidence. And only if you have a test set that you’ve held out or you know that. Now that leads to the obvious question, which is very challenging, is if you spent three months working on a model, worked well on your validation set, you did a good job of locking that test set away in a safe so you weren’t allowed to use it, and at the end of the three months you finally checked it on the test set, and it’s terrible. What do you do? Honestly, you have to go back to square one. There really isn’t any choice other than starting again. So this is tough. But it’s better to know, right? Better to know than to not know. So that’s what a test set’s for.

[00:59:04]

So you’ve got a validation set, what are you going to do with it?

Metrics: Accuracy, Pearson correlation coefficient, etc.

What you’re going to do with a validation set is you’re going to measure some metrics. So a metric is something like accuracy. It’s a number that tells you how good is your model. Now on Kaggle, this is very easy. What metric should we use? Well they tell us. Go to Overview, click on Evaluation, and find out and it says, oh, we will evaluate on the Pearson correlation coefficient. Therefore this is the metric you care about. So one obvious question is, is this the same as the loss function?

Loss function vs. metric

Is this the thing that we will take the derivative of and find the gradient and use that to improve our parameters during training?

[01:00:01]

And the answer is, maybe, sometimes, but probably not. For example, consider accuracy. Now if we were using accuracy to calculate our derivative and get the gradient, you could have a model that’s actually slightly better. It’s doing a better job of recognizing dogs and cats, but not so much better that it’s actually caused any incorrectly classified cat to become a dog. So the accuracy doesn’t change at all. So the gradient is zero. You don’t want stuff like that. You don’t want bumpy functions, because they don’t have nice gradients. Often they don’t have gradients at all, they’re basically zero nearly everywhere. You want a function that’s nice and smooth. Something like, for instance, the average absolute error, mean absolute error, which we’ve used before. So that’s the difference between your metrics and your loss.

[01:01:03]

Now be careful, right, because when you’re training, your model is spending all of its time trying to improve the loss, and most of the time that’s not the same as the thing you actually care about, which is your metric. So you’ve got to keep those two different things in mind. The other thing to keep in mind is that in real life, you can’t go to a website and be told what metric to use. In real life, the model that you choose, there isn’t one number that tells you whether it’s good or bad. And even if there was, you wouldn’t be able to find it out ahead of time.

Metrics and AI: Importance of considering the entire process

In real life, the model you use is a part of a complex process, often involving humans both as users or customers, and as people involved as part of the process. There’s all kinds of things that are changing over time, and there’s lots and lots of outcomes of decisions that are made.

[01:02:07]

One metric is not enough to capture all of that.

Metrics and AI: The problem with metrics

Unfortunately, because it’s so convenient to pick one metric and use that to say, I’ve got a good model, that very often finds its way into industry, into government, where people roll out these things that are good on the one metric that happened to be easy to measure. And again and again, we found people’s lives turned upside down because of how badly they get screwed up by models that have been incorrectly measured using a single metric. So my partner Rachel Thomas has written this article, which I recommend you read, about the problem with metrics is a big problem for AI.

[01:03:03]

Goodhart’s Law: When a measure becomes a target, it ceases to be a good measure

It’s not just an AI thing. There’s actually this thing called Goodhart’s Law that states, when a measure becomes a target, it ceases to be a good measure. The thing is, so when I was a management consultant, you know, 20 years ago, we were always kind of part of these strategic things, trying to find key performance indicators and ways to kind of, you know, set commission rates for salespeople, and we were really doing a lot of this like stuff, which is basically about picking metrics. And you know, we see that happen, go wrong in industry all the time. AI is dramatically worse because AI is so good at optimising metrics. And so that’s why you have to be extra, extra, extra careful about metrics when you are trying to use a model in real life.

Kaggle: Using Pearson correlation coefficient

Anyway, as I said in Kaggle, we don’t have to worry about any of that.

[01:04:00]

We’re just going to use the Pearson correlation coefficient, which is all very well, as long as you know what the hell the Pearson correlation coefficient is. If you don’t, let’s learn about it.

Pearson correlation coefficient: Measuring similarity between variables

So Pearson correlation coefficient is usually abbreviated using the letter R, and it’s the most widely used measure of how similar two variables are. And so if your predictions are very similar to the real values, then the Pearson correlation coefficient will be high, and that’s what you want. R can be between minus one and one. Minus one means you predicted exactly the wrong answer, which in a Kaggle competition would be great because then you can just reverse all of your answers and you’ll be perfect. Plus one means you got everything exactly correct. Generally speaking, in courses or textbooks when they teach you about the Pearson correlation coefficient, at this point they will show you a mathematical function.

[01:05:04]

Understanding the behavior of the Pearson correlation coefficient

I’m not going to do that because that tells you nothing about the Pearson correlation coefficient. What we actually care about is not the mathematical function, but how it behaves. And I find most people even who work in data science have not actually looked at a bunch of data sets to understand how R behaves. So let’s do that right now so that you’re not one of those people. The best way I find to understand how data behaves in real life is to look at real life data.

California Housing dataset: Visualizing correlations

So there’s a data set, Scikit-learn comes with a number of data sets, and one of them is called California Housing, and it’s a data set where each row is a district. It’s demographic information about different districts and about the value of houses in that district. I’m not going to try to plot the whole thing because it’s too big, and this is a very common question I have from people is, how do I plot data sets with far too many points?

[01:06:08]

Plotting large datasets: Random sampling

The answer is very simple, get less points. So I just randomly grab a thousand points. Whatever you see with a thousand points is going to be the same as what you see with a million points. There’s no point, no reason to plot huge amounts of data generally, just grab a random sample. Now, NumPy has something called core-coef to get the correlation coefficient between every variable and every other variable.

NumPy core-coef: Calculating correlation coefficients

And it returns a matrix, so I can look down here and say for example here is the correlation coefficient between variable 1 and variable 1, which of course is exactly perfectly 1.0, because variable 1 is the same as variable 1. Here is the small inverse correlation between variable 1 and variable 2, and medium-sized positive correlation between variable 1 and variable 3, and so forth.

[01:07:01]

This is symmetric about the diagonal, because the correlation between variable 1 and variable 8 is the same as the correlation between variable 8 and variable 1. So this is a correlation coefficient matrix. So that’s great when we want to get a bunch of values all at once. For the Kaggle competition we don’t want that, we just want a single correlation number. If we just pass in a pair of variables, we still get a matrix, which is kind of weird. It’s not weird, but it’s not what we want. So we should grab one of these. So when I want to grab a correlation coefficient, I’ll just return the 0th row, 1st column. So that’s what core is, that’s going to be our single correlation coefficient.

Visualizing correlations: Scatter plots and alpha transparency

So let’s look at the correlation between two things. For example, median income and median house value, 0.67. Okay, is that high, medium, low?

[01:08:00]

How big is that? What does it look like? So the main thing we need to understand is what these things look like. So what I suggest we do is we’re going to take a 10 minute break, 9 minute break, we’ll come back at half past, and then we’re going to look at some examples of correlation coefficients. Okay, welcome back. So what I’ve done here is I’ve created a little function called showCorrelations, I’m going to pass in a data frame and a couple of columns as strings, I’m going to grab each of those columns as series, do a scatter plot, and then show the correlation. So we already mentioned median income and median house valuation of 0.68. So here it is, here’s what 0.68 looks like.

Correlation coefficient: 0.68

So I don’t know if you had some intuition about what you expected, but as you can see it’s still plenty of variation, even at that reasonably high correlation.

[01:09:09]

Also, you can see here that visualising your data is very important if you’re working with this dataset, because you can immediately see all these dots along here, that’s clearly truncation. So this is like, it’s not until you look at pictures like this that you’re going to pick stuff like this up. Pictures are great. Oh, little trick, on the scatter plot I put alpha as 0.5, that creates some transparency, for these kind of scatter plots that really helps because it kind of creates darker areas in places where there’s lots of dots. So yeah, alpha in scatter plots is nice. Here’s another pair.

Correlation coefficient: 0.43

So this one’s gone down from 0.68 to 0.43, median income versus the number of rooms per house. As you’d expect, more rooms, it’s more income.

[01:10:03]

But this is a very weird looking thing. Now you’ll find that a lot of these statistical measures, like correlation, rely on the square of the difference. And when you have big outliers like this, the square of the difference goes crazy. And so this is another place where you’d want to look at the data first and say, oh, that’s going to be a bit of an issue. There’s probably more correlation here, but there’s a few examples of some houses with lots and lots of room where people that aren’t very rich live. Maybe these are some kind of shared accommodation or something.

Outliers: Sensitivity of correlation coefficient

So R is very sensitive to outliers. So let’s get rid of the houses with 15 rooms or more. And now you can see it’s gone up from 0.43 to 0.68. Even though we probably only got rid of 1, 2, 3, 4, 5, 6, we only got rid of 7 data points.

[01:11:04]

So we’ve got to be very careful of outliers, and that means if you’re trying to win a Kaggle competition where the metric is correlation, and you just get a couple of rows really badly wrong, then that’s going to be a disaster to your score. So you’ve got to make sure that you do a pretty good job of every row. So there’s what a correlation of 0.68 looks like. Okay, here’s a correlation of 0.34. This is kind of interesting, isn’t it?

Correlation coefficient: 0.34

Because 0.34 sounds like quite a good relationship, but you almost can’t see it. So this is something I strongly suggest, is if you’re working with a new metric, is draw some pictures of a few different levels of that metric to kind of try to get a feel for what does it mean? What does 0.6 look like, what does 0.3 look like, and so forth. And here’s an example of a correlation of minus 0.2. There’s a very slight negative slope.

[01:12:07]

Correlation coefficient: -0.2

Okay, so there’s just more of a kind of a general tip of something I like to do when playing with a new metric, and I recommend you do as well.

Understanding the behavior of a new metric: Visualizing different levels

I think we’ve now got a sense of what the correlation feels like. Now you can go look up the equation on Wikipedia if you’re into that kind of thing. Okay, we need to report the correlation after each epoch because we want to know how our training is going.

Reporting correlation after each epoch

HuggingFace expects you to return a dictionary because it’s going to use the keys of the dictionary to label each metric. So here’s something that gets the correlation and returns it as a dictionary with the label Pearson. Okay, so we’ve done metrics, we’ve done our training validation split. Oh, we might have actually skipped over the bit where we actually did the split, did I?

[01:13:02]

Splitting data into training and validation sets

I did. So to actually do the split, in this Kaggle competition, I’ve got another notebook we’ll look at later where we actually split this properly, but here we’re just going to do a random split. Just to keep things simple for now, 25% of the data will be our validation set. So if we go dsTrainTestSplit, it returns a dataset dict which has a train and a test. So that looks a lot like a datasets object in Fast.ai, very similar idea.

Training a model with Hugging Face Transformers

So this will be the thing that we’ll be able to train with. So it’s going to train with this dataset and return the metrics on this dataset. This is really a validation set, but HuggingFace datasets calls it test.

[01:14:01]

We’re now ready to train our model.

Trainer: Equivalent of learner in Fast.ai

In Fast.ai, we use something called a learner. The equivalent in HuggingFace transformers is called trainer. So we’ll bring that in.

Mini-batches and batch sizes

Something we’ll learn about quite shortly is the idea of mini-batches and batch sizes. In short, each time we pass some data to our model for training, it’s going to send through a few rows at a time to the GPU so that it can calculate those in parallel. Those bunch of rows is called a batch or a mini-batch, and the number of rows is called the batch size. So here we’re going to set the batch size to 128. Generally speaking, the larger your batch size, the more it can do in parallel at once and it will be faster, but if you make it too big, you’re going to get an out-of-memory error on your GPU. So it’s a bit of trial and error to find a batch size that works. Epochs we’ve seen before, then we’ve got the learning rate.

[01:15:03]

Learning rate

We’ll talk in the next lesson, unless we get to it this lesson, about a technique to semi-automatically find a good learning rate. We already know what a learning rate is from the last lesson. I played around and found one that seems to train quite quickly without falling apart. So I just tried a few. Generally, if I don’t have a… Hugging Face Transformers doesn’t have something to help you find the learning rate. The integration we’re doing in Fast.ai will let you do that, but if you’re using a framework that doesn’t have that, you can just start with a really low learning rate and then kind of double it and keep doubling it until it falls apart. Hugging Face Transformers uses this thing called training arguments, which is a class where you just provide all of the kind of configuration.

Training arguments: Configuration for Hugging Face Transformers

So you have to tell it what your learning rate is.

[01:16:05]

This stuff here is the same as what we call basically fit one cycle in Fast.ai. You always want this to be true, because it’s going to be faster, pretty much. And then this stuff here you can probably use exactly the same every time. There’s a lot of boilerplate compared to Fast.ai, as you see. This stuff you can probably use the same every time. So we now need to create our model.

Creating a model: Automodel for sequence classification

So the equivalent of the VisionLearner function that we’ve used to automatically create a reasonable vision model, in Hugging Face Transformers they’ve got lots of different ones depending on what you’re trying to do. So we’re trying to do classification, as we’ve discussed, of sequences. So if we call automodel for sequence classification, it will create a model that is appropriate for classifying sequences from a pre-trained model.

[01:17:04]

And this is the name of the model that we just did earlier, the DiBerto V3. It has to know when it adds that random matrix to the end, how many outputs it needs to have. So we have one label, which is the score. So that’s going to create our model and then this is the equivalent of creating a learner.

Creating a trainer: Equivalent of creating a learner

It contains a model and the data. The training data and the test data. Again, there’s a lot more boilerplate here than Fast.ai, but you can kind of see the same basic steps here. We just have to do a little bit more manually, but it’s nothing too crazy. So it’s going to tokenize it for us using that function and then these are the metrics that it will print out each time.

Metrics: Printing out correlation coefficient

That’s that little function we created which returns a dictionary. At the moment, I find Hugging Face Transformers very verbose. It spits out lots and lots and lots of text, which you can ignore.

[01:18:00]

We can finally call train, which will spit out much more text again, which you can ignore.

Training the model

And as you can see, as it trains, it’s printing out the loss and here’s our Pearson correlation coefficient. So it’s training. And we’ve got a 0.834 correlation. That’s pretty cool, right? I mean, it took, well it doesn’t actually say, but it just took, oh here we are, five minutes to run. Or maybe that’s five minutes per epoch on Kaggle, which doesn’t have particularly great GPUs, but good for free. And we’ve got something that is, you know, got a very high level of correlation in assessing how similar the two columns are. And the only reason it could do that is because it used a pre-trained model, right? There’s no way you could just have that tiny amount of information and figure out whether those two columns are very similar. This pre-trained model already knows a lot about language. It already has a good sense of whether two phrases are similar or not.

[01:19:00]

And we’ve just fine-tuned it. You can see, given that after one epoch it was already at 0.8, this was a model that already did something pretty close to what we needed. It didn’t really need that much extra tuning for this particular task. Have we got any questions there, John?

[John Williams]

Question: How to decide when to remove outliers?

Yeah, we do. It’s actually a bit back on the topic before where you were showing us the visual interpretation of the Pearson coefficient, and you were talking about outliers. And we’ve got a question here from Kevin asking, how do you decide when it’s okay to remove outliers? It’s like you pointed out some in that data set, and clearly your model’s going to train a lot better if you clean that up. But I think Kevin’s point here is those kinds of outliers will probably exist in the test set as well. So I think he’s just looking for some practical advice on how you handle that in a more general sense.

[01:20:05]

[Jeremy Howard]

Outliers: Never just remove them

So outliers should never just be removed, like for modelling. So if we take the example of the California housing data set, if I was really working with that data set in real life, I would be saying, oh, that’s interesting. It seems like there’s a separate group of districts with a different kind of behaviour. My guess is that they’re going to be kind of like dorms or something like that, probably low-income housing. And so I would be saying like, oh, clearly from looking at this data set, these two different groups can’t be treated the same way, they have very different behaviours, and I would probably split them into two separate analyses. You know, the word outlier, it kind of exists in a statistical sense, right? There can be things that are well outside our normal distribution and mess up our kind of metrics and things.

[01:21:04]

It doesn’t exist in a real sense. It doesn’t exist in a sense of like, oh, things that we should ignore or throw away.

Outliers: Valuable insights and understanding their source

Some of the most useful insights I’ve had in my life in data projects has been by digging into outliers, so-called outliers, and understanding what are they and where did they come from. And it’s kind of often in those edge cases that you discover really important things about where processes go wrong, or about kinds of behaviours you didn’t even know existed, or indeed about labelling problems or process problems which you really want to fix them at the source because otherwise when you go into production, you’re going to have more of those so-called outliers. So I’d say never delete outliers without investigating them and having a strategy for understanding where they came from and what should you do about them.

[01:22:08]

Trained model: Similar to a Fast.ai learner

All right, so now that we’ve got a trained model, you’ll see that it actually behaves really a lot like a fast AI learner, and hopefully the impression you’ll get from going through this process is largely a sense of familiarity, of like, oh yeah, this looks like stuff I’ve seen before, a bit more wordy and some slight changes, but it really is very, very similar to the way we’ve done it before. Because now that we’ve got a trained trainer, rather than learner, we can call predict, and now we’re going to pass in our dataset from the Kaggle test file.

Predicting with the trained model

And so that’s going to give us our predictions, which we can cast afloat, and here they are. So here are the predictions we made of similarity.

[01:23:04]

Now, again, not just for your inputs, but also for your outputs, always look at them.

Always look at your inputs and outputs

Always. And interestingly, I looked at quite a few Kaggle notebooks from other people for this competition, and nearly all of them had the problem we have right now, which is negative predictions and predictions over one.

Negative predictions and predictions over one: Fixing the problem

So, I’ll be showing you how to fix this in a more proper way, maybe hopefully in the next lesson, but for now, you know, we could at least just round these off, right? Because we know that none of the scores are going to be bigger than one or smaller than zero, so our correlation coefficient will definitely improve if we at least round this up to zero and round this down to one. As I say, there are better ways to do this, but that’s certainly better than nothing. So in PyTorch, you might remember from when we looked at ReLU, there’s a thing called Clip, and that will clip everything under 0 to 0 and everything over 1 to 1.

[01:24:08]

Clipping predictions to 0 and 1

And so now that looks much better. So here’s our predictions. So Kaggle expects submissions to generally be in a CSV file, and HackingFace datasets kind of looks a lot like Pandas really.

Creating a submission file

We can create our submission file with our two columns, call.csv, and there we go. That’s basically it. So yeah, it’s kind of nice to see how far deep learning has come since we started this course a few years ago, that nowadays, you know, there are multiple libraries around to kind of do the same thing.

[01:25:00]

Deep learning: Progress and opportunities

We can use them in multiple application areas, they all look kind of pretty familiar, they’re reasonably beginner-friendly, and NLP, because it’s kind of like the most recent area that’s really become effective in the last year or two, is probably where the biggest opportunities are for big wins both in research and commercialisation.

NLP: Huge opportunity area

And so if you’re looking to build a start-up, for example, one of the key things that VCs look for, that they’ll ask, is like, well, why now? Why would you build this company now? And of course, with NLP, the answer is really simple. It can often be like, well, until last year this wasn’t possible, or it took ten times more time, or it took ten times more money, or whatever. So I think NLP is a huge opportunity area.

[01:26:06]

So it’s worth thinking about both use and misuse of modern NLP.

Subreddit: Automatically generated conversations between GPT-2 models

And I want to show you a subreddit. Here is a conversation on a subreddit from a couple of years ago. I’ll let you have a quick read of it. So the question I want you to be thinking about is, what subreddit do you think this comes from, this debate about military spending? And the answer is it comes from a subreddit that posts automatically generated conversations between GPT-2 models. Now, this is a totally previous generation of model. They’re much, much better now.

[01:27:00]

So even then, you could see these models were generating context-appropriate, believable pros. I would strongly believe that any of our upper tier of competent, fast AI alumni would be fairly easily able to create a bot which could create context-appropriate pros on Twitter or Facebook groups, or whatever, arguing for a side of an argument.

NLP: Potential for misuse

And you could scale that up such that 99% of Twitter was these bots. And nobody would know. Nobody would know. And that’s very worrying to me, because a lot of the way people see the world is now really coming out of their social media conversations, which at this point, they’re controllable.

[01:28:04]

NLP: Controllable social media conversations

It would not be that hard to create something that’s kind of optimized towards moving a point of view amongst a billion people in a very subtle way, very gradually over a long period of time by multiple bots each pretending to argue with each other and one of them getting the upper hand and so forth. Here is the start of an article in The Guardian, which I’ll let you read.

The Guardian article written by GPT-3

This article was quite long. These are just the first few paragraphs. And at the end, it explains that this article was written by GPT-3.

[01:29:01]

It was given the instruction, please write a short op-ed around 500 words, keep the language simple and concise, focus on why humans have nothing to fear from AI. So GPT-3 produced eight outputs, and then they say basically the editors at The Guardian did about the same level of editing that they would do for humans. In fact, they found it a bit less editing required than a human’s. So again, you can create longer pieces of context-appropriate prose designed to argue a particular point of view. What kind of things might this be used for? We won’t know probably for decades, if ever, but sometimes we get a clue based on older technology.

Net neutrality: Auto-generated comments

Here’s something from back 2017 in the pre-deep learning NLP days. There were millions of submissions to the FTC about the net neutrality situation in America, very, very heavily biased towards the point of view of saying we want to get rid of net neutrality.

[01:30:10]

An analysis by Jeff Kao showed that something like 99% of them, and in particular nearly all of the ones which were pro-removal of net neutrality, were clearly auto-generated. By basically, if you look at the green, there’s like selecting from a menu, so we’ve got Americans as opposed to Washington bureaucrats deserve to enjoy the services they desire. Individuals as opposed to Washington bureaucrats should be just blah, blah, blah. People like me as opposed to so-called experts should be. And you get the idea. Now this is an example of a very, very simple approach to auto-generating huge amounts of text. We don’t know for sure, but it looks like this might have been successful because this went through, despite what seems to be actually overwhelming disagreement from the public that almost everybody likes net neutrality, the FTC got rid of it.

[01:31:09]

And this was a big part of the basis, was like, oh, we’ve got all these comments from the public and everybody said they don’t want net neutrality. So imagine a similar thing where you absolutely couldn’t do this, you couldn’t figure it out because everyone was really very compelling and very different. It’s kind of worrying about how we deal with that.

Bot classifiers: Difficulty in detecting bot-generated content

I will say when I talk about this stuff, often people say, oh, no worries, we’ll build a model to recognize bot-generated content. But if I put my black hat on, I’m like, nah, that’s not going to work. If you told me to build something that beats the bot classifiers, I’d say no worries, easy. I will take the code or the service or whatever that does the bot classifying and I will include beating that in my loss function and I will fine-tune my model until it beats the bot classifier.

[01:32:08]

Beating bot classifiers: Including beating the classifier in the loss function

When I used to run an email company, we had a similar problem with spam prevention. Spammers could always take a spam prevention algorithm and change their emails until it didn’t get the spam prevention algorithm anymore, for example. So, yes, I’m really excited about the opportunities for students in this course to build, I think, very valuable businesses, really cool research and so forth using these pretty new NLP techniques that are now pretty accessible. And I’m also really worried about the things that might go wrong.

NLP: Opportunities and concerns

I do think, though, that the more people that understand these capabilities, the less chance they’ll go wrong. John, was there some questions?

[01:33:00]

[John Williams]

Question: Should num labels be 5 instead of 1?

Yeah, I mean, it’s a throwback to the workbook that you had before. Yeah, that’s the one. The question, Manikandan is asking, shouldn’t num labels be 5, 0, 0.25, 0.5, 0.75, 1 instead of 1? Isn’t the target a categorical, or are we considering this as a regression problem?

[Jeremy Howard]

Num labels: One label for one column

Yeah, that’s a good question. So there’s one label because there’s one column. Even if this was being treated as a categorical problem with five categories, it’s still considered one label.

Regression problem: Treating the target as a continuous variable

In this case, though, we’re actually treating it as a regression problem. It’s just one of the things that’s a bit tricky. I was trying to figure this out just the other day. It’s not documented as far as I can tell on the Hugging Face Transformers website, but if you pass in one label to auto model for sequence classification, it turns it into a regression problem, which is actually why we ended up with predictions that were less than 0 and bigger than 1.

[01:34:09]

So we’ll be learning next time about the use of sigmoid functions to resolve this problem, and that should fix it up for us. Okay, great.

Conclusion: Enjoying NLP and looking forward to the next lesson

Well, thanks, everybody. I hope you enjoyed learning about NLP as much as I enjoyed putting this together. I’m really excited about it and can’t wait for next week’s lesson. See ya.