Exploring spaCy document objects
Import libraries
(0:00 - 1:42)
All right, so the way you get started is you first install these three libraries, spaCy, pandas and itables. Of course, we know what spaCy and I think you probably know what pandas is, it is a data frame library. Itables is the Python library which allows you to visualize any data frame as a table using the jQuery data tables library and that is very powerful because the data tables library has been around for a long time and you get all its features pretty much, you get all its features just out of the box.
And the other thing that we are going to install is the ncorewebmd model, the English medium sized spaCy model. This is provided by the explosion team themselves and this is sort of like it has a good combination or a good tradeoff between the size of the model and the power of the model. So, let us go ahead and run this first.
You can see that it is going to take a few seconds to download this model from this GitHub site and then it is going to install it and it is also going to install the other libraries. So, that is the first step and I am going to clear the outputs and then we are going to import the libraries that we just installed. So, let us go ahead and run that and it is taking a few seconds.
So, now, it is both installed as well as imported the libraries that we need and we will be using all this for the next set of lessons.
Splitting text into sentences
(0:00 - 3:19)
The first thing that you do when you have a paragraph of text is that you want to split it into multiple sentences because it just makes it easier for you to read as well as analyze what is going on and what we have here is a VAERS write up which is just a description of what happened and what I will do is I am going to just show you the code which people usually have used when they want to split some text into sentences you see that this text string it just has the same text you have over here in the write up. So, the first step is you have to create this NLP object where you load the n-core WebMD model ok and then you declare this text variable which has all this information and then you construct the document object by calling NLP on the text ok and then once you have this document object you have all the stuff which is inside a document object in spaCy and one of them is the dot sense which is the list of sentences and you can just iterate through it and then print the text. So, let us go ahead and do that and this is what you see you see that we have you have about 10 sentences right or maybe it is more than 15 in fact, but anyway when you just read this list of sentences this is pretty good, but what you can do is you can go step further and you can render this whole information in a table ok.
So, that is what I am going to do now and let me just show you the output first just to show you what it looks like you can see that you have this list of sentences over here it looks a little nicer and you have this text as the column name, but you also see that you have built-in pagination and you even have a way to search ok. So, you can see that you can search for specific words inside these sentences and believe it or not all this just comes out of the box when you use the itables library because what it does is under the hood it is using the jQuery data tables library which itself is providing all these features. So, it takes the jQuery data tables library and then you know makes it I guess usable within a Jupyter notebook and once you do that you get all the power of data tables within the Jupyter notebook and I say that it is actually considerable power and you will know see what I mean when the schema or the columns that you are going to use in the table once it starts getting more complicated you will understand exactly what I mean when I say that the underlying table data tables jQuery library itself is extremely powerful and just by using that you will be able to get a lot of insight into what is going on in the spaCy document objects ok.
So, for now all we have is a list of sentences this is not you know particularly interesting but you will see that as we go on to the next set of lessons you will see how it helps that we can render this whole thing in the sort of interactive tabular format.
Splitting text into words
(0:00 - 0:50)
A second very common use case for spaCy is to split text into individual words and we use spaCy tokens to do this, but it is very important to note that even though spaCy tokens quite often and most often in fact they correspond to individual words, but sometimes they do not. For example, sometimes individual words are split into multiple spaCy tokens and then there are other cases where a single token could span what we might consider to be multiple words. So, what I am going to do now is I am going to run this the code that we have over here and you can see that it is just displaying all the tokens in this table and there are 404 tokens in this in the text that we have.
(0:50 - 1:33)
This is the same text that we used in the previous in the previous lesson and there are 404 tokens ok. So, you can just use the pagination to go through this list and just to give a quick example you can see that 16 hyphen April hyphen 2021 is split into three different tokens. Although you would probably think of it as a single word right, it is a date, so you might think of it as a single word or sometimes people do not, but I am just pointing out that there is not a one to one correspondence between spaCy tokens and English words.
(1:33 - 1:55)
And again you have for example the punctuation is a token all by itself, but it is not a word as we usually understand words in English right. So, we have all these closing brackets and these hyphen marks and even the in the periods and all that. So, those are not English words, but they are spaCy tokens ok.
(1:55 - 2:45)
So, with this what we see is you have to just iterate through the document, you just say for token in document and this alone will give you access to this token object and in this particular case I am going to extract the text by just calling this text property on the token object. I am going to append it to this data array that I am building and then I am going to convert that into a data frame using pandas and then I will just show this data frame, display this data frame and inside the itables this is the command to be able to do that. And you see that we have this sort of nicely printed list of tokens that we can just see in the tabular in this interactive tabular format.
Part of speech tagging
(0:00 - 2:11)
And we can use spaCy tokens to identify these part of speech tags. You can just look at the token.POS underscore attribute and it will give you the part of speech tag. I will just point out that the underscore at the end indicates that it is a string attribute and if it does not have an underscore it usually means that it is not a string attribute.
And we are going to use the same technique or you know same method that we have been using before. You have this list called data and then you iterate through each token in the document and at each step that is for each token you are going to append some object into this data list. In this case the first column will be text and I get the tokens text and then the second column will be POS which is the part of speech tag.
And then I convert it into a data frame and then I display it using itables. So, let us go ahead and see what this looks like. All right.
So, now you have something which is a little more complex in terms of the tabular information because you have two columns right. Now let us say that you just want to see the let us just do this and you can see that this is when you click on this sort button or icon over here it is going to sort this the list that is at least the part of speech column is going to sort all these values right in ascending order. And then you click on it one more time and then it is going to do the same in descending order right.
And this is a way for you to look at all the text and I think that because of the way we constructed it there is no way to go back to the original display maybe I can try running it again yeah. So, this is what it start this is where we started ok. Now you can see that you get the part of speech tag for all the tokens including the punctuation.
(2:11 - 2:42)
And notice that the punctuation is it has its own part of speech tag right it has this tag called punct all right. And you can see that you have this for all the different punctuations all right. So, this is how you get part of speech tags in spaCy and I have just shown you a way to visualize the part of speech tag for each token in a way which is actually pretty easy for you to see on a interactive table format.
Stop words and punctuation
(0:00 - 1:26)
When you are processing text using spacing, a very common task is that you want to ignore stop words and these are very common words which you have in English and you also want to ignore punctuation marks okay. A good example is let us say that you are constructing or rendering a word cloud based on the given text and if you were to include the stop words which are the words like a and the word like the and all that, these are very common and it will probably appear in every single sentence, sometimes multiple times in the same sentence. What will happen is that the stop words will start dominating your word cloud and will be the sort of the biggest, the one which is rendered with the highest font right.
So you do not want to do that because it is usually not really adding any insight. In the same way punctuations which appear quite often in every single line will have the period at the end of it and maybe your particular text might be littered with a lot of punctuations like comma and semicolon and all that which is in fact true for the for the VAERS dataset that is the symptom, the write up that you see it usually has a lot of semicolons you have already seen that, it also has a lot of commas and it has many hyphens. So you do not want to include these when you are trying to render a word cloud of the text.
(1:26 - 2:38)
So what you want to do is you want to check if a given token is a punctuation or if it is a stop word and spaCy makes it really easy for you to do that because it just includes these as attributes, you can call is underscore stop to check if a token is a stop word and you can call is underscore punt to check if a token is a punctuation. So let us go ahead and run this and you can see that this is the result and we see that a lot of these are, so for the stop words let us go ahead and see what is the stop word. So there you see this is a stop word via I do not know if that is a stop word apparently spaCy thinks it is and it is a stop word and in the same way you can see that the punctuation marks are all marked as true under the is underscore punt column.
So this is one more reason why I like this data tables rendering of this information because you can just scroll through all of it and even get to the end of it and you can see at a glance like what is the stop word whether a given token is a stop word or it is a punctuation etc.
Visualizing spaCy text spans
(0:00 - 0:37)
We are going to be looking at is the text span, before we do that I want to mention that the token dot i, it provides the word level index of the token in the text. Now this is the kind of thing which is very easy to demonstrate using the itables visualization, so I am just going to run this and you can see that you have this text and then you have the index, this is just incrementing as integers. So for every single text or token that you see, you can see that the token dot i increments by 1 ok.
(0:37 - 1:04)
So these are the list of token indices ok, that is the i values for each token ok. Now once you see this, then you can come to the concept of a text span. So a text span is a list of tokens in that is in spaCy a text span is a list of tokens and it has a start and an end token which is specified using the value of i ok.
(1:05 - 2:05)
So here you can look at this loop that I am using, so first what I do is I initialize the data, this is the data list, let us go ahead and move it over here and I start with a start value of 0 and then or actually I just assign start value to 0 because I will not be changing this start value. But what I do is I will create a loop which is going to loop over the numbers 1 through 5 and I create a span which is you can see that if you say doc, open the square bracket and then put the first index, this is the starting index and a colon and then the ending index, you get a text span in spaCy ok. And then I append into this list, I get the start point, I get the end point and then the text which is between those two right.
(2:05 - 2:38)
Now I am going to convert it into a data frame as before and display it in itables, let us see what this looks like. So you have the start value is 0 for all the 4 values in the range and so let us just make the range to 6, so that it is you get 5 values over here alright. So you can see that the start value is 0 for all of them, end value is 1, 2, 3, 4, 5 and the doc 0 colon 1 is the first token which is vaccination.
(2:38 - 3:13)
The doc 0 colon 2 is the, it is going to have, it is going to span the first two tokens. So that is going to be vaccination site and then the same way you see vaccination site pain is a third token, pain semicolon is the fourth and then there is a word vaccination which comes right after that, that is the fifth ok. So it is just going to display all the text which is contained in that span ok, that is what the span dot text is going to do ok.
(3:13 - 3:35)
So this is how you use a span in spaCy. Now there is also a way to get the character level position of a token if you want, it is called the top dot idx, it is different from the i which is the word level index. If you want to get the character level index, you are going to use the dot idx and let us go ahead and run this.
(3:36 - 4:21)
It is the same idea, I have the start and the end and the text and then I have the next token which is the token that we are going to be appending here and then the index position of whichever token that we are appending. So you can see that it is going from 0 and the next one is going to 12 and then to 17 and 21 and 23. So this is different from the dot i for the token, this is not the word level index, but this is rather the character position of that particular token where it starts within the string or the text or the you know the document that we are considering that is what the idx represents over here.
(4:21 - 5:14)
And this must also show you one more thing that you can see sort of what we are kind of getting from the visualization system that I have been showing you or which I have developed right. When you construct a data frame with the set of columns and then you just use itables to display this information, what is happening here is that you can just keep adding these columns one after the other and I do not have to do any extra work to make sure that it is rendered in a tabular format which also provides you know all these the features that you see which is coming out of the box with the itables library right. So that is the big advantage of doing this in this in this way and that is how I was able to do this demo.
(5:14 - 5:37)
I think that this demo is easy for you to visualize right, that is the whole purpose you can see there is your start point, there is a end point, you know what text is inside, you know what the next token is and you also know the index position of the next token. So this rendering the whole thing in this tabular format it just makes it very easy for you to visualize all this information at a glance.
Visualizing spaCy dependency parse tree
(0:00 - 1:04)
Alright, so next we are going to look at the Dependency Parse Tree and DisplaySee is a built-in spaCy visualization tool which will allow you to visualize the dependency parse tree of spaCy document and the way it works is that every sentence has a dependency tree that is you get one tree per sentence and while the dependency parse tree it displays a lot of information, we are mainly interested in the structure and the relative positions of the words in the parse tree. Now you can do a lot of advanced things if you have a very strong understanding of English grammar and I will be very honest English is only my second language, so I do not have a very strong understanding of English grammar. I will usually use the dependency parse tree for the relative positions and you will actually see that it is just doing that alone will allow you to write some rules which are which can be very powerful ok.
(1:04 - 2:14)
So if you look at the code here I import display see and then we construct the document object and I get the first sentence which I can do by calling next on the doc.sends object and then I can do this display see dot render. So let us go ahead and run this and what you see is that it is going to display see is going to render this information in a way which is it is actually a SVG file it is an image file and it is going to render it in a way where the words are going to be listed in the order in which they appear right. So you have this vaccination site pain and so on that is you can you can read the sentence just by looking at this image ok it is going to be in order but the arrow that is the trees the tree itself is constructed in such a way that you have these arrows going from the head token to the child token.
(2:14 - 4:18)
So every child every token has a head so in this case let us consider this token was it has a head which is the received token which is a verb and then you have case is also having received as its head the word this has the token case as its head and so forth right you can look at the entire list of tokens in a given sentence you will find that all of them have their respective heads and this is how you construct a tree the tree data structure ok. So if you scroll down that is what I mentioned here every token has a head attribute and every sentence has something called the root token and the root token is the one which has no head it is the head of the root token is the token itself ok. So if you were to look at this particular sentence I am going to see if received has a head so entered is the head of received so and I think it looks like entered has no head for itself so this entered token is going to be the root of this particular sentence ok as I mentioned the head of the root token is the token itself and let us go ahead and run this and this is what it is going to list you can see that for every token it is going to display the head and you can see that entered has head as also entered that is entered as a root token ok and for the given token you can also access the sentence using token dot sent and then you can access the sentence properties like the sentences text so let us run this and here I have this text and the head and then the sentence corresponding to this token ok and the other interesting thing is that every token has something called a sub tree which is the tree portion which corresponds to just that token.
(4:19 - 6:23)
So let us just run this and look if you see vaccination the tree the sub tree is also vaccination so it helps to consult this particular figure here ok so vaccination has no children so its sub tree is going to be just vaccination on the other hand if you were to look at site it has one child called vaccination so vaccination and site together will be the sub tree of the token site for pain it is going to be vaccination site pain together ok if you scroll down and that is actually this for pain it is more than that it is vaccination site pain and then vaccination site erythema so let us see if that is confirmed yeah. So you can see that for pain you have these as the children but if you see there is also an arrow going over here and which has its own set of children so this is a good example when you look at the pain token this, this, this, this, this all these tokens are children right so there that is the entire sub tree and it does not have to be just one level down it can be multiple levels down so you see that if you start with pain as your sort of the current root it has all these tokens as the descendants right so that is the that forms the whole sub tree and that is what you can see when you look at this, this table and how do I construct it. So I look at this token dot sub tree it is a span and I am going to loop through the span tokens and then I am going to get the text for each token and I am going to use it as the phrase ok and then I am going to do this join and then I will as usual I will append it into this data list and then I construct a data frame from that list and then I will show that as a table using itables.
(6:23 - 7:04)
So to, to, to summarize you have this dependency tree so displacy will allow you to visualize the tree every token has a head attribute every sentence has a single root token and the root you identify the root by looking at its head it should be the token itself and you can access the sentence from the token and access properties of the sentence from that but every token will have a sub tree which will include the token itself and you can get the full text of a sub tree by doing something like this ok so that is what I was able to demonstrate using this video.
Named Entity Recognition
(0:00 - 0:33)
Another common task which people do using spaCy is called Named Entity Recognition. And I like this definition best when it comes to named entity recognition, it is a component of natural language processing that identifies predefined categories of objects in a body of text ok. And if you think about it, this can apply to a lot of different types of categories of objects and you will see an example of this right away.
(0:33 - 0:52)
So, what I am going to do is using the displayCy visualization tool, it can visualize both dependencies and entities. The parse tree, you can get the parse tree by using style equals depth and you can use you can get the entities by using style equals end. Let us go ahead and run this.
(0:52 - 1:26)
And you can see that this is this renders in a different way, it renders as a paragraph of text. And for every time spaCy is able to identify an entity, it is going to highlight it in a very unique way, it is going to put like a rounded rectangle around that word and put the entity type as a subscript ok. So, here it has identified myalgia as a person that is wrong by the way the way that the it is able to do this entity recognition.
(1:26 - 1:54)
In this particular case, it is not working correctly, but I am just using this as an example of how the dependency or rather the entity visualization works in spaCy ok. So, let us clear this and let us do the usual thing that we do to see the list of entities. So, in this case how you do that is you there is this doc dot ends, it is what you get the list of entities by calling doc dot ends.
(1:54 - 2:42)
So, I am going to iterate through each entity and then every entity has a label and you have to you can access it by using dot label underscore and of course, there is already some text inside that entity and I am going to use int dot text to display the entity text. Now, one important thing that to remember is that entity does not is not a single token. In fact, if I hover over it you can see it says that it is a span ok, it is a it could be one or more tokens and then you construct the data frame and so on and here you have the output of running this code and you can see that it has all these entity labels it says myalgia is a person, 16 is cardinal, moderna is org and so on.
(2:42 - 3:29)
Now, you may be wondering this is great, but what do these labels mean and what I you can do is you can actually use space e dot explain on top of the entity label and it is going to give you a definition of what it means ok. So, let us go ahead and do that and you can see that person is it just means people, cardinal means numerals, org is organizations, gpe stands for countries cities and states it is basically locations and you can see that it is doing this for every entity it is able to identify ok. And in fact, I also put the phrase which contains that entity and so that I can highlight where it appears in the context of that particular sentence ok.
(3:30 - 3:50)
So, this is the way you do the named entity recognition out of the box you if you want to use the built in model you just call doc dot ends and then it is going to give you a list of entities you can iterate through each of them and you have to remember that an entity is not a single token, but it is a list of tokens.
Token is_ attributes
(0:00 - 1:13)
In a previous lesson, I mentioned how you can use is underscore stop and is underscore punct to identify if a token is a stop word or a punctuation. And as it happens, these is underscore attributes of tokens, it allows you to test a token for a lot of different things and this is the list of possible things that you can test for. You can check to see if token text consists of alphabetical characters using is underscore alpha, is underscore ASCII, it will allow you to check if the token consists just of ASCII characters, is digit if it just digits and is lower checks to see if the token is entirely in lower case, is punct is whether it is a punctuation, is space will let you check if a token is just white space, is stop is to check if it is a stop word, is title is interesting because it checks to see if the token text is in title case and you know the title case it means that the first letter is upper case and the remaining letters are still in lower case and this can be very useful in the case of testing if a token is the starting of a sentence which is usually in title case right.
(1:13 - 1:44)
And is upper of course it is going to check if a token text is in completely in upper case. Let us run it with the usual system and here you have all these attributes listed for every single token and let us just take a look at the you know list of columns we have, is alphabet, is ASCII and so on. Now in this particular visualization I have also added a search builder and again this is just coming out of the box with the data tables jquery library.
(1:44 - 2:32)
What it allows you to do is it allows you to build a search builder ok and what is a search builder? You can take every single column name and use it as part of this kind of a search interface ok. So for example if I want to see everything here which is a title ok, let us see if is title is contains true ok and you can see that these are the list of tokens which are of type is underscore title ok and you can see that they all have uppercase in the first character and the rest are all lower case ok. Now let us check to see if is whitespace ok.
(2:32 - 3:43)
So is whitespace contains true and you can see that these are all empty which is exactly what you would expect and how about checking for stop words and these are the list of stop words which makes sense and we can also check to see if any word is completely uppercase and you do see that there are a few words like that which are all in uppercase and how about is digit do we have any numbers. So you can see that these are all just tokens which are where the is digit is true because these are all corresponding to numbers. Now I just point out that sometimes the token parsing is poor which is why you see this 19 is a part of COVID-19 and you probably do not want it to be an individual token all by itself you want COVID-19 to be a single word and a single token, but I am just pointing that out because once it is done the split and it is extracted the token that token will still satisfy the is you know is digit as being true.
Token like_attributes
(0:00 - 1:04)
Just like you have is underscore attributes for a token in spaCy you also have something called like underscore attributes and these are the three that spaCy provides, there is a like underscore email, like underscore num and like underscore URL and of course the reason it is called like and not is because this is a little bit more of a fuzzy match and it can do this with even with cases where it is not a perfectly formatted email or a perfectly formatted number or a perfectly formatted URL. In fact for the case of number it is even more interesting because it can even get words which are representing numbers, so if you write the number 7 for example as SEVEN it will still like num to be true ok. So, let us just go ahead and run this in the usual way you can see that you have all these like email, like num and like URL and let us just add this do a search on these.
(1:05 - 2:08)
So, what I am going to see is if there is anything here which is true equals, so this is not and you can see that there is no token in this given text which looks like an email, how about like URL is there anything that looks like a URL if there is not anything, but you will find that for the case of number I am sure that some of them will return to, so there you see you can see that all these numbers are showing up. But in terms of the like num even it is different from is digit because you can see that the second is a text it is a it is all alphabetical characters, but it still shows up as like num because it is representing a numerical value. So, in the same way first is also showing up as a like num equals true, so you can see that this is where it is different from is digit.
(2:08 - 2:24)
So, you can use these like underscore attributes of a token to check to see if it is like a number or like a URL like an email as far as I am aware there are only three of these I do not think there is any other like underscore attributes for a token which is built in.
More token attributes
(0:00 - 1:25)
The DEP is the tokens dependency label and you can get an explanation of this by using spaCy.explain of the token.DEP underscore and then the lemma is the lemma form of the token, the lower case form the tokens morphological analysis and then the norm is the normalized form of the token text and then orth is the exact verbatim text of a token okay. So if you were to run this and I have already run it and you can see here the DEP underscore explain is providing an explanation of what the dependency label actually means. You can see that for example this is supposed to be an appositional modifier and you have an adjectival modifier.
I am going to be honest I do not know a lot of these things. What exactly they mean I have an idea but I do not use these for generating spaCy rules unless there is no other choice but it is good to know all these things and it is of course very good to know like you can do it as a read only right. You understand like how it works and all that but you have to be probably quite cautious when you are using it in your actual rules because sometimes the rules in the English language are not very consistent.
(1:26 - 2:35)
So this is just a list of all this information you can see there is your lemma, your lower case and the normalized form of the text and then your orth is gives you as it says the exact verbatim text of a token and then morph is like the morphological analysis. You can see there is a singular and then the type of the pronoun and all that. So this is a good way to see what each token represents but I will also add that I will be somewhat cautious about using these attributes as part of your spaCy rules when you want to extract information it is probably best to stick to the simplest ones like just getting the text of a token and the relative position in the dependency tree and things like that and if you are going to be using the dependency label for example or you are going to use the morphological analysis and all that you may want to do a lot of testing to see that the results are what you actually expect.
Remaining token attributes
(0:00 - 1:21)
Let us take a look at the remaining attributes of a token, you have the POS which is the part of speech tag, we already seen this and then you have the sent underscore start which indicates whether the token is the start of the sentence, you have the shape, I will show you what the shape is when we visualize this and then the tag is the tokens fine grained part of speech. So let us go ahead and run this and the first thing that I want you to notice is that I have this tag underscore explain column which is going to do a space c dot explain on the token dot tag underscore okay. So let us come in reverse order, the tag could be one of these values and NN means known singular or mass and then this is punctuation and there are cases where you will be using the tag and then there are other cases where you will be using the part of speech depending on your requirements and then the shape is kind of interesting, you can see that if it is a word you have all these x marks and then if it is let us say if it is a title case you can see the first character is uppercase, you can see the shape also will have the first x marked as uppercase and then let us say you look at a date and you can see that it has something resembling the date but DD here represents a number.
(1:21 - 2:08)
So you look at this shape and you might be able to infer that it is actually date, sometimes this can be helpful using the shape can sometimes be helpful and then you have the sentence start, let us use the search builder for this, if the sentence start equals 1 so these are the sentence starting tokens okay and you can see that they are all like usually uppercase, the first letter is uppercase right and then the part of speech tag is something we have already seen before that is the spacey part of speech tag, so these are the other attributes that you can have in a token and you will be able to use these sometimes to create your rules for identifying or extracting information.
Visualizing spaCy subtree
(0:00 - 2:44)
Advantages of using ITables is that it supports HTML content inside table cells. This means you can visualize the subtree that we have already seen before by using the span of the subtree that is all the tokens and then adding some HTML markup to that span. So this is this is how you are going to do it right.
You for each token in the document you are going to get its subtree token dot subtree and I am going to construct a list where the for each token I get the i which is the word level position and then I construct a list out of it which is you know it is going to be a set of like sequential numbers right. Then I will define subtree left i as the first element in that list and then subtree right i will be the last element in that list. Sent will be the sentence corresponding to the token we need that so that we can do the highlight in and all that and then I am going to create a formatted sentence and this is how it will work.
I will iterate through each token in the sentence not the whole document but just the sentence and then I extract its text okay. If I find that the i that is the position of the token and remember this is relative to the document and not relative to the beginning of the sentence okay this is relative to the document. If the position of the token equals the same as the subtree left i then I will add a bold tag as well as the underline tag and then if it is equal to the right most element in the subtree then I will close these two tags in sequence okay and then the formatted sentence will be I just add the token text to the formatted sentence.
So if the token is not matching these two conditions it is just going to append the text if it does match these conditions it is going to add these tags either prepend it or append it to the end right and then I am going to create the data list as before convert it into a data frame and then I am going to show this information using itables. So let us run this and this is what it looks like. So you have the text called vaccination and this is the subtree if the text is site this is the subtree whatever is underscore underlined okay for the word pain for the token pain rather the whole thing forms its subtree okay and you can see you can keep going like that and you know you can just go to that entered has this very large subtree as you can see and I think that is the entire sentence right.
(2:44 - 2:59)
So in other words because itable supports this kind of HTML markup you can also use this to highlight certain information and make it easier for you to visualize. So that is what I have shown using this subtree example.
Visualizing spaCy token heads
(0:00 - 0:56)
Now that you have seen that we can use HTML inside the itables library, one more thing you can do is you can add font colors to visually distinguish specific information. So in this example what I am going to do is use the same technique to explain the token and its head and the sort of visualize how they look like within a given sentence. So I am going to use the same approach as the previous lesson, iterate through each token in the document, get the sentence corresponding to the token and I am going to construct a formatted sentence and the way I do that is I am going to look at each token inside that sentence and so this is the inner token, token 1 and I am going to get the text of the token 1, then I will check to see if the i which is the position of the token is matching the outer tokens i which is to say that it is the same token.
(0:56 - 2:34)
In that case I am going to wrap it inside this red color font and if the tokens i is matching the outer tokens head which is that is it is the inner token is the head of the outer token, then and I am also going to check to see if the i values are not the same for both the head and the token, this usually happens only for the root of the sentence, I want to ignore the highlighting for that case. In that situation if the condition matches I am going to wrap that token inside the blue color font and then I am going to construct the entire sentence, I am going to append it to the data list and then convert it into a data frame and then I am going to show that using the itables library. So this is what it is going to look like, you can see that the vaccination token has the head called site, in this case the token is right next to it and then there will be some cases where you will notice that for example pain and received, pain is appearing first but the head is looking it is sort of far away from the token right, that is the received is the head of pain but you can see that it is not contiguous and then you will also have some situation where the head comes before the token, so here pain is the head of erythema but you can see that it appears before the erythema token in the text of the sentence right, so you can you can keep going and you can keep looking at the different tokens and their heads and just to see their relative positions.
(2:34 - 2:51)
So what we have done here, we have used the fact that you can add HTML to the itables to add font colors to be able to better distinguish specific information which we are interested in.