Intro to NLP with spaCy (5): Detecting programming languages | Episode 5: Rules vs. Machine Learning [Jun 13, 2020]
[Vincent]
Intro and Recap of Previous Video
Hi, welcome to video number five in this tutorial series on spaCy. In the previous video we made this NLP object that had our very own custom pipeline component in it and every time that we gave it some text out would come a document and this document would have some entities and in particular it would have our very own programming language entity that’s the one that we trained for so it’s not just the entities that spaCy gives you when you have one of their models we actually added an entity of our own that was meant for our use case so programming languages are now proper entities and these are now detected so that’s cool but we have two ways of finding these entities we have our trusty matcher object
Comparing Rule-Based and Machine Learning Approaches
and we also have our machine learning approach and these are two different approaches
[00:01:02]
they presumably both have their pros and cons but something I would really like to do now in this video is to properly compare the two approaches I think by comparing them there’s going to be more understanding of the problem which is nice but I also like the idea of if I’m going to report on numbers for my machine learning model that then I compare it against at least something of a valid benchmark which is what I’m hoping to do with the matcher in order to do that though I do
Saving Models to Disk for Comparison
need to be able to have both of these models on disk and that way I can actually properly compare them so I’m going to start writing code for all this and I hope we’re pretty excited because usually at this phase of the project where you start comparing like pros and cons of approaches things tend to get quite interesting because this is one of those moments where you’ll get some nice lessons about your problem and I’ve always found this to be the fun part of the project so to say so let’s get started okay so I’m back in the notebook and let’s just have a brief look at
[00:02:06]
Reviewing Code for Entity Ruler and Matcher Patterns
some of the code what you’ll see here is that I’m importing the entity ruler from spaCy as well as this create patterns function from a common.py file now you might remember those patterns from a previous video so this is a collection of regex like features that will help us detect certain programming languages and you might remember the the reason why we have some of these regexes here that had to do with different version numbers of different languages so that’s why those are in there but that’s the function that we’re using to keep our code clean in the notebook so I’ll just comment that out but given those patterns the main thing that’s important in the code over here is that I’m making a entity ruler object over here and this is going to be another step in our pipeline and this is going to represent a entity detection model based on those
[00:03:04]
matcher rules so as you can see in this line what we’re doing is we’re taking every pattern that we have in that create patterns function and we have this entity label called programming language and then we give it a pattern and that’s how this ruler pipeline step is building up and you can also see that we start with an empty english nlp model over here and what we’re doing is passing some text through it and we can see that golang is indeed detected as a programming language so we’ve made a model using the matcher rules that is able to detect entities on our behalf
Preparing Data for Scoring and Using spaCy’s docs_to_json
as well so that means that we have a matcher based nlp object and we also have a ml based nlp object for detecting entities and if both are saved on disk then we can easily load them then we can automate the evaluation a bit and to do that spacey offers some helpful functions so in this
[00:04:02]
case what you can do is you can save the rules to disk that will be the rules from this entity ruler over here but what you can also do is just save the entire nlp model to disk and in this case you’re going to save a file to disk in this case you’re going to save a folder but in any case we’ve saved the model to disk so this is something that we can use later so i’ll just run those everything’s saved to disk that’s good the next thing that we need to do is we need to prepare
Using spaCy’s CLI for Evaluation
our data for scoring so i’ve got this example here of a single document and what we would like that to look like so just to see the example here you’ll notice that this json blob has eventually this raw key and that contains the raw text and what you can see is that this is split up into different tokens and you can see the word that it’s captured and what you can also see is that
[00:05:02]
a lot of them have any r colon and then o behind it this would mean that there is no programming language in fact no entity that’s being detected but some of them have this u programming language b programming language l programming language but the main idea is that if you want to compare models your data has to be in a certain format and this is a very general format that spaCy prefers and the reason for that is you can use this format for lots of things so if you have text that has a category you can put that in here as well you might be able to add some extra information here like parts of speech and if you want to train for all of that you can totally do that with this data format in mind but for again my intents and purposes i mainly only care about these named entity recognition parts if you want to get your data in this format i highly recommend you check out the spaCy command line tools there’s this command called spaCy convert that can definitely help you and otherwise there’s also this other very helpful function in this spaCy gold
[00:06:05]
submodule you will find docs to json that’s a something that you can use i’ll not go too in depth in that here but definitely check out the documentation if you’re preparing this for your own problem now what i’m going to do because i’ve done this beforehand so this entire data set over
Discussing Evaluation Results and Highlighting Differences in F1 Score
here that’s already on disk and what i’ve done is i’ve got this training data that i’ve used to train my original machine learning model and i’ve also got my validation data set and those two are separate and this file is saved on disk and so is this one and what i can now quite conveniently do is just from the command line run the evaluation scripts and that i figured that’d be nice to demo because i didn’t know that before doing this actually but spaCy has a really convenient command line interface and one of the things that we’ve used it for is to download language models but you can also run your evaluation scripts and you can even train models
[00:07:04]
as well just using the command line interface so what i’m going to do now is i’m just gonna
Demonstrating spaCy’s CLI for Model Evaluation
use the command line interface to start with some evaluation so i’m in the terminal right now and what i can do is i can type python dash m spaCy and you’ll then see lots of commands that are available at our disposal and in particular i mean you probably recognize this download we’ve used that before in one of the earlier videos but let’s now use this evaluate command to evaluate the model that we just saved on disk to do that you have to type python dash m spaCy and then evaluate and i’ve got this folder called video 5 and inside of that is where we have our rule-based model it’s the one we’ve just saved and what i would like to do then is from my training data folder i would like to grab the validation data set and again that’s the data that i’ve prepared
[00:08:03]
up front using the format that i just showed you so let’s just run this okay so we can see a bunch of things that it’s just done we see some statistics like how many words that we actually have how many words per second was it able to handle one thing to keep in mind is that there’s also these scores for things like parts of speech and that’s zero right now but the reason why it’s zero is also because we didn’t add any labels for it to train on so that’s something that we can safely ignore so a similar thing right here with the text category like where we didn’t add any labels for the text category so again we’ll ignore that but the thing that’s mainly of interest to me at least is this precision recall and f score for the named entity recognition because that’s the score that tells us how well we are at detecting programming languages so i’ll keep that in mind i’ll make a note of it but what i’m also going to do now is run the same command but now for the other model
[00:09:08]
the other model was also saved using the to disk command on the nlp object and that was the best model that i’ve made from the previous video so what i’m just going to go ahead and do is give that my validation data as well and again i must stress this is a data set that this model did not train on so like don’t cheat here like you have to make sure it’s a separate data set but i’ve made sure that’s the case and i can now run this well typo and here we now get the other scores as well so i’m going to summarize this and put that in the
Analyzing Evaluation Metrics and Comparing Performance
there okay so what i’ve done is i’ve taken the statistics that i figured were the most interesting and i’ve copied them here in the notebook so i can you know discuss them briefly in particular
[00:10:02]
one thing that i immediately noticed and i do think that’s somewhat relevant to at least point out is that if you look at the number of words per second that the rule-based system is able to handle and you compare that against the machine learning approach then you do notice that the rule-based system approach is a fair bit quicker and this is a factor three difference now that’s definitely not to say that the machine learning approach is slow but i do believe it’s a valid observation to make and the second thing that i immediately notice is that it seems if i look at the f1 score over here that the machine learning model performs a bit better than the rule-based one a fair bit even you can look at the recall and then it seems that the rule-based model is slightly better but where the machine learning model compensates is in the precision so the machine learning model over here has like a nine percent hike and that’s and that’s you know that’s that’s a pretty big
[00:11:04]
difference so i imagine that’s also where the difference between the f1 score comes from but that is something that’s actually quite interesting so at the moment if i were to look
Exploring Examples Where Models Disagree
at this and then wonder what’s the next step i mean i could be tuning the hyper parameters here i mean that’s i guess fine or something but at the moment i’m just a little bit more curious in understanding this difference because you know a nine percent difference is pretty big because i have a gut feeling if i understand that then i might learn something that is going to be really good for the model as well so i just want to explore that before i consider doing anything with hyperparameter tuning here so what i’m just going to go ahead and do
Manually Examining Model Disagreements with Generated Text
and i think that would be an interesting way of exploring the difference is i’m going to take the matcher model and i’m going to take the machine learning model and i’m going to look at examples where the matcher and the machine learning model disagree and i think if i can look at when they
[00:12:05]
disagree and then i can also maybe get a feeling of when do i agree with the matcher more than i do with the machine learning model then i might get an impression of where the gaps of knowledge are so to say so i’m just going to do that little exercise i’m just going to look through and find me some examples where these two differ and then and then let’s see if there’s anything we can learn from that so i’m back in the notebook and what i’m going to now do is i’m going to
Loading Saved Models and Using displaCy for Visualization
try to poke both models to see if i can get them to disagree so because both models are saved on disk i can easily load them in like so and what i’m going to be doing here is i’m going to generate some text and then i’m going to use this placey the visualization tool inside of spacey to just show me what one model thinks and what the other one thinks and that gives me a nice opportunity to sort of start comparing them so i’ve done this poking and i’ve prepared a few statements and there
[00:13:02]
Analyzing Specific Examples of Model Disagreements
were some interesting lessons here so in this first example you’ll see that the statistical model does not detect go but the rule-based one does in this one example and i can imagine that go is also one of the harder ones in general to go ahead and detect so and this might partially explain why we see a slightly higher recall in the rule-based approach as well so i figured that was interesting and then i had another example which i thought was also really interesting so here’s an example where i say i’m totally going to do me some css3 python dash 3 sql and html here and you can see that html i mean that’s picked up perfectly fine in both scenarios and so is sql but we notice that python dash 3 that is picked up incorrectly by both methods and i think what’s happening here is i did not have a rule to catch that as a reg x
[00:14:05]
so that can be a reason why the rule-based system doesn’t pick it up and it is also something that probably doesn’t occur in the data which is why the machine learning model doesn’t pick that up but this phenomenon i found to be quite interesting because it got me thinking you know it’s really easy for me to add a rule for something that might happen in the future and even if it’s not in the data then the rule-based system does have a way of dealing with it whereas this machine learning system if the phenomenon is not in the data it cannot predict it so in that sense i was pleasantly reminded that this rule-based system also comes with an insurance policy of sorts that’s something i hadn’t fully realized before doing this exercise so i felt that was an interesting lesson i suppose but i also see something else here that is quite interesting as well because this is something that i like
[00:15:03]
seeing in the model and that’s what i see happen here with css so because css here is not detecting the three that means that there’s not a rule for it there’s not a rule that is able to add the number three onto the css that’s probably a rule that i forgot about but then it is interesting that the machine learning model has been able to generalize here somewhat and that is something i really like to see even if there were not examples of css followed by a three it was able to recognize that css on its own probably is an entity and if it’s then followed by a number that it can just go ahead and merge those two things together which in this case i would argue it has correctly done so i felt this was an interesting example as well but uh but let’s move on to another one so in this example i’m saying i’m old school i work with common lisp and objective c3
[00:16:03]
and there’s two interesting mistakes happening here so first off common is not attached to lisp over here in the statistical model and this is the same phenomenon as before i believe this is one of those situations where i was able to add a rule up front but there might not be enough training data for the machine learning model to pick this rule up so that’s the interesting observation there and also something that’s happening here is it’s incorrectly recognizing that objective c version 3 is the one thing that’s supposed to be detecting so that’s interesting to observe this is something that’s going wrong in both scenarios and there’s actually something worrying about this example now that i think about it
Identifying Potential Issues with Training Data Labeling
if i think about how i started labeling i was using a matcher to generate lots of labels and then by hand i was checking in excel if i did it correctly that also means that if my rules are
[00:17:03]
incorrect that there is a risk that the machine learning model is going to be learning those incorrect rules so that is something to think about now that i think about it um so what i’m
Transitioning to Analyzing Disagreements in Actual Training Data
going to do now is i think i’m on to something here and i like some of the lessons that i’m learning but currently what i’m doing is i’m generating my own text and then i’m checking manually where i’m making the mistakes and i think for initial poking that’s fine but what i would like to do now is repeat this but then use the actual training data and see where the models disagree there because that way i might be able to get a little bit of an impression if there’s anything funky in my training data as well so i will get the codes together to do that now okay so
Writing Code to Filter and Analyze Disagreements in Training Data
what i’ve done now is i wrote a cell over here but what it’s going to do in this case is it’s going to filter rows where the rule-based and the statistical-based model disagree so
[00:18:04]
this data frame that i end up with that will be a data frame with disagreement in it and then below here i have a cell where every time that i run the cell i’m just going to get a new random example from that data frame and that way i can just sample away and keep poking and get a bit of an impression of which mistakes are happening to which models and that might help me paint a picture of why so what i’m going to do is i’m just going to run that cell a whole bunch of times what i might also do at some point is change that unequal symbol to an equal symbol that i can also see when models tend to agree but the example i have right now for example is already quite nice because i can see that in this case the rule-based system is just missing a html5 example and the xhtml2 example is also not detected while the statistical model is able to pick that up so
[00:19:00]
so what i’m going to do now is i’m just going to run this a whole bunch of times and then i’ll just summarize my findings and i’ll draw them out for the video so i’ve learned some
Summarizing Findings from Analyzing Training Data Disagreements
interesting things that are worth mentioning i noticed this one example where there was c sharp and then immediately a comma javascript and immediately another comma and then php and i noticed that both models got different parts wrong in examples like this and that’s because there are commas here and i can imagine the tokenizer is actually having a little bit of splitting this up appropriately this is an issue that’s independent of both so this is an issue that’s independent of both models this is an issue with the fact that some of the text that is in stack overflow is really unlike the english language in in situations like this so this issue has nothing to do with the models it’s just that this is not really how english text is
[00:20:01]
supposed to be written so i can imagine this is why the tokenizer is having a bit of issue but there’s also some examples where someone wrote down c sharp and then immediately a seven these are some issues that i saw but this is something that both models had a bit of trouble with what i did notice is that the matcher will make everything lowercase immediately and that also means that capitalization is somewhat less of an issue but for the statistical model whether or not something is capitalized may actually make a bit of a difference there so this was nice to have in the back of my mind because it felt like the matcher was slightly more robust against this so that was also nice to understand but in general i noticed something that was a bit curious so i i had a look at when the model agreed and when it disagreed and i noticed that when the model agreed there were some programming languages like python there was javascript there was java most of these scenarios it was actually both models
[00:21:05]
would just agree with each other and they’d be right i noticed there was a little bit more disagreement when there was a language like go as well as a language like sequel and part of what’s happening here with sequel can be attributed to the capitalization something i’ve noticed but when i was properly investigating this i was reminded you know the stuff here that we agree on these are somewhat frequent programming languages right these are i i think these are the most popular languages on the stack overflow platform and i can then also imagine that that go example that one was really infrequent and this kind of got me thinking because it might be the case that i’m really really good at predicting these languages right but if those are the languages that are a the most common and therefore perhaps also the easiest to predict
[00:22:03]
then i might have a model that slightly overfits on the easy cases and isn’t necessarily able to handle the harder cases so the thought that i had before got me a little bit worried so i figured
Investigating Potential Imbalance in Training Data
an extra investigation was all right and what i have done is i have taken the first 40 000 rows from my training data that’s my questions.csv file and what i did is i made two separate matchers one with all the python rules and the other one with all the golang rules and i figured this would be good because then i can check you know how many examples of each do i actually have the python matcher was able to find 820 examples whereas the golang matcher was only able to find 65 examples and then i started wondering okay out of these examples how many of them were actually captured by our statistical model and then python was captured 818 times but the go programming language
[00:23:07]
was only captured about 31 times so that suggests that we have like a good 99% overlap hit rate between the two approaches here and like not even 50 here and it’s fair to mention that there’s two things happening here i think in general go is a much harder language to detect but the fact that they’re also underrepresented here isn’t helping and this gap is making me wonder if that precision score that i started out with if that’s a number i should blindly trust because because that number is an average of the entire group and if i am judging it based on something that is relatively easy as it has many of these python examples then i don’t know if i’m doing a good job at benchmarking and reporting good numbers so i’m really happy i did this exercise but it does
[00:24:01]
leave me now to rethink what i should do in terms of data quality so let me describe a new plan that
Proposing a New Plan for Data Quality Improvement
i’ve got in my mind for this so let’s discuss how we ended up with this data set in the first place i have my csv file and what i then did is i took that csv file and i came up with some rules we made this matcher so we could detect many different programming languages i use this matcher to create a subset of the data that i would actually manually label so i had this data set over here that was the excel file which i used to make my golden labels so to say those were the labels that i would say yes those are the ones that i trust so that’s the data set that i convinced myself was the one i would use and that eventually led to a statistical model
[00:25:00]
Outlining Steps for Continuous Model Improvement
and here’s what i think i should now do so for starters i think what i need to do is i need to make sure that there’s a minimum amount per language in this data set so let’s say in the training data set i must have a hundred examples for go and for maybe some more of the obscure languages just to make sure that i’ve got that area covered because i wasn’t aware of the imbalance but there really seems to be something of an imbalance here so i figured it’s a good idea if i were to take active effort to make sure that i’m covering a lot of ground after having done that i like to think that my statistical model will be slightly better but i see an opportunity here for continuous improvement because what i can do is i can have a look at when my statistical model as well as my matcher model when they perhaps disagree and i think it’s the instances where the matcher and the statistical model where they disagree
[00:26:00]
those are excellent candidates to start labeling first so by making sure i label those disagreed examples first i should be able to continuously improve my statistical model but i should also see examples that hopefully will inspire me to make a better matcher and when i start doing this i’m in this nice little loop that will just lead to continuous improvement and i guess i can even keep track of hey suppose i retrain and do like another 200 labels how much improvement is there and i i can imagine i might be able to guesstimate how long i should keep labeling but i think this is a better approach and something i should start getting set up than if i were just to be tuning a couple of hyperparameters in here and it’s mainly because at the moment i kind of want to be careful with suggesting i have a really good model if my data set that i’m judging it on contains really easy examples and i i have evidence that it might be happening here so
[00:27:01]
so what i’m going to do now and i’ll do that off camera because i think it’s going to be boring
Implementing Data Labeling and Model Retraining
watching me do that but what i’m going to go ahead and do is i’m going to do a first iteration of this and i hope in doing so that i’ll just have slightly better labeled data over here and then what i’m going to do is i’m going to judge both the matcher and the statistical model just one more time to see if i’ve made an improvement and the way i’m going to be labeling this is a bit manual i’m going to generate one data set that has lots of examples per language just make sure that that’s labeled and then what i’m going to do is i’m gonna do the same thing like batches of 200 examples i think just to set where they disagree label all of that and i will be using prodigy for this i think at this point in time that will save me some effort but you should also be able to do this in excel or whatever other tool you like so i’m going to do an iteration of this and
[00:28:00]
then we’ll do an evaluation of both models and we’ll that’ll be a nice way to wrap up this video
Discussing Results of Retraining and Improved Recall
so i’ve done all the labeling and what i’ve also done is i’ve ran the same commands from the command line as i showed you before and i’ve also trained a new model from the command line and here’s some of the results we see in the rule-based approach that definitely the numbers are worse but we also see that here the examples that are in there are now more hard so to say because there’s more examples of go and i’ve also added examples where the machine learning model as well as the statistical model were confused so these are inherently trickier examples in that sense but there is one thing that looking at these results make me optimistic and that is this notice that the recall in the statistical model is now higher than the recall in the
[00:29:01]
matcher and before this was the other way around and to me that’s a nice signal that we are in the end generalizing i’ve been able to come up with some rules that cover a lot of ground but the statistical approach is going beyond that especially when you consider that the precision is pretty much the same it seems that our statistical approach is offering us something that we didn’t have before so i would argue this is progress and i’m happy that i did this exercise
Looking Ahead to Future Considerations for Evaluation Data Sets
in this video the main theme that i’ll start thinking about now with regards to the next video is what might be a really good evaluation data set because currently i’m trying to optimize for something that’s general which i think is a good starting point but we are going to have to at some point move this towards a application and that means that we might need to find a way
Concluding Remarks and Outro
to collect a data set that’s representative of that application that we can use as a test set
[00:30:02]
because of course it will depend on the application but odds are that stack overflow questions are not exactly the same as data that we’ll get in an application so that’s something i will start thinking about but i’m very happy that i have this flow where i can make improvements to my labels so thanks for watching and maybe see you next video