Improving xAI Grok accuracy for structured information extraction

Measuring the baseline accuracy

import instructor
from pydantic import BaseModel
import os
from openai import OpenAI
from dotenv import load_dotenv
import json
import pandas as pd
import time

load_dotenv()
XAI_API_KEY = os.getenv("XAI_API_KEY")

class PatientInfo_v0(BaseModel):
    days_to_symptom_onset: int

client = instructor.from_openai(OpenAI(api_key=XAI_API_KEY,
                                       base_url="https://api.x.ai/v1", ))

df: pd.DataFrame = pd.read_csv(f'csv/japan_100.csv')

file_name = 'xai_numdays_map_v0.json'
with open(f'json/xai/{file_name}', 'r') as f:
    xai_numdays_map = json.load(f)

num_rows = len(df)

for index, row in df.head(num_rows).iterrows():
    try:
        start_time = time.time()
        vaers_id = str(row['VAERS_ID'])
        if vaers_id in xai_numdays_map.keys():
            print(f'Skipped {vaers_id}')
            continue
        symptom_text = row['SYMPTOM_TEXT']
        print(f'Index = {index} Processing {vaers_id} Length = {len(symptom_text)}')
        # Extract structured data from natural language
        pinfo = client.chat.completions.create(
            model="grok-beta",
            response_model=PatientInfo_v0,
            messages=[{"role": "user", "content": f'{symptom_text}'}],
        )

        end_time = time.time()
        duration = end_time - start_time
        json_output = pinfo.model_dump(mode='json')
        json_output['duration'] = duration
        xai_numdays_map[vaers_id] = json_output
        if index % 10 == 0:
            with open(f'json/xai/{file_name}', 'w+') as f:
                json.dump(xai_numdays_map, f, indent=2)
    except Exception as e:
        print(e)

with open(f'json/xai/{file_name}', 'w+') as f:
    json.dump(xai_numdays_map, f, indent=2)
(0:00 - 0:16)
All right, so as we get started, you have to install all these libraries as usual, and I have already done that. So, you will notice that it is going to say all the requirements are satisfied and you can see that it is already satisfied. Now, let us clear the outputs.

(0:17 - 2:03)
What I want to mention here is that in this set of lessons, I am not going to run any code here which calls any of the APIs. I have already ran this code before and stored the results, and I have done that obviously to save time because I do not want to be running code like that as we are recording the video, but what you have is this xai folder under which you are going to see all the corresponding JSON files. You just have to go and open this folder and you will see that you have this list of files.

So, you can take a look at it for reference. So, this is like all these files are based on calling the xai API, ok. Now, what exactly are we doing here? Now, in this lesson, what we are going to do is we are going to take each report from the Japan 100 CSV file and try and figure out the number of days to first symptom onset, ok.

And in fact, it is represented by the num days field. You already have that present in the CSV file. And what I am going to do is call the xai grok, I am going to call the xai grok beta model, ok.

That is the one, that is the only one that is available through xai API right now. And of course, xai has provided $25 free credit for everyone just to have people test their LLM basically. And so, what I have seen till now is that the results are ok, but it certainly there is a long way to go for the Twitter AI to at least when it comes to this task of extracting structured data, ok.

(2:03 - 4:47)
So, this is the again I am going to get the environment, I am going to get the API key from the environment variable, ok, and load it into this variable. And then if you look at the class that we are going to use, all I have right now is. So, this is what I call as the baseline, use the simplest shortest field name.

And I am just going to call it days to symptom onset, it is an int. And then you have all this code which is just as before it iterates through the CSV file and gets all the results from the. So, here ok one important thing that I forgot to mention is that in this lesson I am using the instructor Python library.

And of course, I am doing that because XAI does not provide native support for being able to do this call, you cannot just specify the response model as a Python class. So, I am able to do that because I am using the instructor library. So, you can see that because I am using that library, it is got this kind of like additional step you have to do where you say instructor dot from OpenAI.

And why does it say from OpenAI? That is because the XAI API is supporting the OpenAI Python client, ok. So, they just made it completely compatible with the OpenAI client. So, I can just call it as if I am using the OpenAI client over here.

So, it says instructor dot from OpenAI and you pass in the API key and the base URL obviously has to be the one corresponding to the XAI and not to the OpenAI, right. Now, we have this file name I am going to store it into it is called XAI underscore numdays underscore map underscore v naught. So, this is the first iteration and I have already run this code.

So, what happens here is that when I run through this you will see that if it is already in this in this JSON file, it is just going to skip all of it. So, I am going to click on it and you see that it is just going to say everything it skipped all of it that is because the file is already saved. Now, what you can do is you can use that to calculate the accuracy of whatever came back from the XAI API.

So, here I am going to run this code and show you what it does. So, it is as before you have this VAERS ID which is a clickable link and then I took the second column is the numdays from the CSV file and then the third column is the numdays which is calculated by the large language model you can see that it says this XAI numdays map you get for the VAERS ID and then you get this value days to symptom onset ok. And then if these two values do not match I am going to assign this variable called mismatch to be true ok.

(4:47 - 5:34)
So, what we are looking for here is two things first of all if mismatch is true ok. So, that means, that these two numbers do not match what you already have in the CSV file is different from what the LLM reported. So, we have 42 entries like that which is actually a lot right.

So, on the other hand if you say mismatch equals false. So, these are the ones where the LLM agrees with whatever is in the CSV file and you can see that out of 100 entries in the CSV file that is what we started with only 58 of them have agreement between what is in the CSV file and what the LLM returned ok. So, in other words you will say that this has an accuracy of 58 percent.

So, the baseline accuracy that we are starting with is only 58 percent.

Use a descriptive field name

import instructor
from pydantic import BaseModel
import os
from openai import OpenAI
from dotenv import load_dotenv
import json
import pandas as pd
import time

load_dotenv()
XAI_API_KEY = os.getenv("XAI_API_KEY")

class PatientInfo_v1(BaseModel):
    days_to_earliest_symptom_onset: int

client = instructor.from_openai(OpenAI(api_key=XAI_API_KEY,
                                       base_url="https://api.x.ai/v1", ))

df: pd.DataFrame = pd.read_csv(f'csv/japan_100.csv')

file_name = 'xai_numdays_map_v1.json'
with open(f'json/xai/{file_name}', 'r') as f:
    xai_numdays_map = json.load(f)

num_rows = len(df)

for index, row in df.head(num_rows).iterrows():
    try:
        start_time = time.time()
        vaers_id = str(row['VAERS_ID'])
        if vaers_id in xai_numdays_map.keys():
            print(f'Skipped {vaers_id}')
            continue
        symptom_text = row['SYMPTOM_TEXT']
        print(f'Index = {index} Processing {vaers_id} Length = {len(symptom_text)}')
        # Extract structured data from natural language
        pinfo = client.chat.completions.create(
            model="grok-beta",
            response_model=PatientInfo_v1,
            messages=[{"role": "user", "content": f'{symptom_text}'}],
        )

        end_time = time.time()
        duration = end_time - start_time
        json_output = pinfo.model_dump(mode='json')
        json_output['duration'] = duration
        xai_numdays_map[vaers_id] = json_output
        if index % 10 == 0:
            with open(f'json/xai/{file_name}', 'w+') as f:
                json.dump(xai_numdays_map, f, indent=2)
    except Exception as e:
        print(e)

with open(f'json/xai/{file_name}', 'w+') as f:
    json.dump(xai_numdays_map, f, indent=2)
(0:00 - 2:34)
Alright, so now that you have some idea of how we are running this code and sort of the steps that we are taking. Now, let us look at this patient info python class, okay. When I just said days to symptom onset, it is not very clear to the LLM that what I mean is that I want to get to the earliest symptom onset.

So, if there are multiple symptoms, I want the one which is the which is how it is supposed to be recorded, that is how you see it in the VAERS CSV files. So, I just changed that from days to symptom onset, I just changed it to say days to earliest symptom onset, okay. So, that is the only change I have made.

Now, I change the class name to patient info underscore v1, which is what I am using here as the response model and the file name has been changed to xai numdays map underscore v1. And once again, when I click on it, you will see that it will skip all the numbers because I have already saved it, I have already ran this code once and I have saved it into this xai folder over here. So, if I were to click on this v1, you see that this is what it looks like, this is the VAERS ID and then I have this field here called days to earliest symptom onset, I always store the duration so that it is a way for me to see how fast this is running.

And once I have gotten all these values, we can do this accuracy check one more time. This is all the same, it is very similar to what we had in the previous time we check the accuracy, okay. I am going to run this and what you see here is that if I were to say mismatch equals false.

So, these are the cases where it does match, that is the LLM agrees with what is in the CSV file, you can see that it went from 58 to 59. So, it is a very tiny improvement and it just one more of the result agreed with the CSV file, okay. So, that is how you can look at it.

Although theoretically it is not exactly true, some cases it could be that it is not the, that is this 59, the 58 number is not an exact subset of 59 sometimes, you may have to dig a little deeper into that, okay. But for now, we will just take this as it is and it is increased from 58 percent to 59 percent and that is what you can see when you just gave it a more descriptive name, that is if you gave a more descriptive name to the field that you are trying to get the information for.

Add Field description

(0:00 - 2:40)
So, one option you have is of course, you use a longer field name to specify exactly what you are looking for, but it is not always feasible because sometimes the it could make the field name too long and it can be pretty cumbersome right. Actually a better option is to use field descriptions to give a hint to the LLM because we are using Pydantic, you can import the field from Pydantic and also use it inside your class definition. So, you can see here it says days to symptom onset colon int equals field and then you use this description which says human readable description which is also sent as part of the prompt by instructor by the way.

So, if there are multiple symptoms choose the earliest one. So, that is the only additional instruction that I have provided and other than that the code is the same and here we have this v2 of the of the class version 2 and then in the same way the file name has to be map underscore v2 dot json. So, if I run it you might remember that I have already stored this into the json file.

So, these are all skipped, but we can calculate the accuracy based on the value which was stored ok. So, what we are looking for once again is mismatch equals false ok. So, now you can see that it is jumped slightly from 58 percent to 64 percent.

So, we started with 58 and now it is at 64 and I will say that is a that is a pretty good improvement and considering that the only thing that we really did is that we just added the field description you can say parameter to this class definition that is the only thing that we did and in fact now the name has gone back to days to symptom onset which is you know somewhat little more concise instead of saying days to earliest symptom onset right. So, we can use this old name the thing that we first came up with for the baseline, but we use this field and give a proper description and you can see that that already improves the accuracy of the results by a small amount ok. So, it went from about 58 percent to 64 percent is what I think we got this time yeah 64 percent.

So, this is an example of how you can add this extra information into your base class into into your class definition and you can get better results from the instructor client library and in general from the large language model that you are using.

Add an Explanation field

(0:00 - 0:18)
All right, so now we will look at the somewhat more advanced technique, okay. When you ask the LLM to get information, some kind of structured information and this numb days to symptom onset is a really good example. It has to look at two pieces of information.

(0:18 - 2:36)
The first thing it has to figure out is when was the date of the last vaccination and then the second thing it has to figure out is when is the date of the earliest symptom onset and then it has to subtract those two numbers and only then it can get an int value which represents the number of days which is to the symptom onset, the difference between the vaccination date and the symptom onset date, okay. That is a sort of a multi-step reasoning which is involved already when the LLM is trying to do the calculation. So, what we can do based on this is we can ask the LLM to provide an explanation for why it came up with a specific answer, okay.

So, what I have done here in this case is and for the v3 of this class, I have this days to symptom onset int which is the same as before. I just added the same variable name and then I just said underscore explanation and then it is a string and in the field description, I say explain why you chose the value for the corresponding field and I use corresponding because I might use the same description for different values or you know different members in this class. So, I just want to use the same description for all of them without having to change it, you know.

In other words, it is not variable specific, it is just more general, okay. Now, once I do that, what I am doing is I am asking the large language model to also provide an explanation for why it chose a particular value, okay. So, once again all the other code is just as before I have this map underscore v3 dot json this time.

Let us run this, you can see that it skipped all of them, it saved the file already before and now when I were to, if I were to run this accuracy test, okay. So, what I have here is let us do this mismatch equals false, okay. So, you can see that there is a increase, but it is a very small one, it went from 64 percent to 65 percent that is when I ask it for an explanation, it seems to at least it is not gone down, but it seems to have a small increase in terms of what it in terms of the answer that it got, okay.

(2:36 - 2:47)
So, it got it went from 64 to 65 percent when I ask the LLM for an explanation for why it chose the particular value for the num test.

Use an Explanation class with matching sentence

(0:00 - 1:38)
In the previous video what we saw was you can add some explanation that is you can ask the LLM to provide an explanation as a string and it returns a certain value sort of justifying why it shows the specific information for the variable that we are interested in. But what you can also do is you can turn that into a class that is you can make the explanation as a class so that it has two parts. The first part is the explanation itself which we already saw in the last video.

But what you can also add to it is you can ask it to find the matching sentence if there is any okay. So the in this case you can see the class explanation has a method sorry it has a parameter called matching sentence if any which is a string and you can see that the in the field description I say please output the sentence which supports the explanation okay which means if the when this particular information is extracted along with just the explanation the LLM is also going to find the matching sentence right. So now as before I have already run this code and got all the results back.

Now this is going to be v4 of the patient info class and then the map file is going to be v4 dot json. So we can see that it skipped all this but then we can go and look at the accuracy. Now here if you look at the mismatch equals false so that is what we are interested in right.

(1:39 - 2:12)
So notice that you have now 67 entries right. So we have gone from in the previous example that is for the explanation field what we had was if you look at the accuracy of that so we go for here mismatch equals false. So it was 65 before and now adding the matching sentence is added it and made it a bit more it is now 67 percent okay.

(2:12 - 3:46)
So you can see that the accuracy is slowly increasing right. Now what happened here is that what we are doing if you if you think about it what we are really doing is we are breaking everything into a sort of a step by step process and we are it is as if we are asking the LLM to think through and reason all the things it did and then also show us the work right. It is basically it is as if we are telling the LLM hey okay great it is wonderful that you got the answer but why do not you also show us your work okay.

So that is what we are going to what that is what we are doing when we ask all this information for the explanation and then within the explanation we are asking for the matching sentence and etc. So that is the advantage of adding so much information into this class definition so that what you get back is a bit more easy to understand and also it also benefits us in another way which is we can see how exactly it came to the conclusion of extracting that particular information which is especially helpful when you have a very large paragraph of text like what you have here for these VAERS reports there is there is quite a lot of text to read and it is very dense right. It is not as if it is all split into like nice looking nicely formatted paragraphs or something right.

So then it is all the more helpful if it just gets the sentence which matches and just displays that to us so that we understand like okay so this is why you were the LLM was matching this information and extracting this information. So that is what we did when we asked it for the matching sentence.

Add a matching phrase

(0:00 - 1:34)
In the previous lesson, we looked at the matching sentence, we asked the LLM to also provide the matching sentence. What I am going to do in this lesson is ask for the matching phrase in addition to the matching sentence, you can see that I have these two fields here, the first one is matching sentence if any which is a string, please output the sentence which supports the explanation and then matching phrase if any which is also a string, again here I say please output the specific phrase which supports the explanation. And this is v5 of the class, I have this map file is called map underscore v5 dot json, as before I run this code and you know it is already saved into the appropriate folder.

Now, when we were, if we were to calculate the accuracy of this particular, after adding this additional information, okay let us see what happens. When I say mismatch equals false, you can see that the accuracy has now gone down to 61 percent, it was 67 before and now it actually has gone down to 61 percent. Now, I am going to say that this means it needs a bit more analysis and it can happen in certain cases for certain kinds of information and so which I have to obviously add, do not use this tip, do not use this tip if does not help your particular use case or your particular data, the kind of data that you are extracting.

And this is also a good example of why after all this there is still a fair amount of trial and error involved in all these additional steps that you can take to increase the accuracy.

Prompt Engineering: Update the System instruction

(0:00 - 0:44)
Alright, so we have tried a few things and many of them have worked and you did see that one trick did not really help, but till now the things that we have done would just qualify as tweaking certain aspects of how we are getting information and none of them would actually qualify as prompt engineering. And when you say prompt engineering, what you are usually doing is you are changing the text that you are sending in the prompt itself, okay, at least that is how I define it. Maybe behind the scenes there is some prompt engineering happening no matter what we do, but as far as the explicit, let us call it explicit prompt engineering, we have not done any till now.

(0:44 - 1:10)
So, that is one more trick that we have up our sleeve that we can try. So, what we are going to do next is we can use this prompt engineering that is we can we can design a prompt which is we can update the system instruction based on the explanations that we got. We can see if there are the patterns that we see in the errors and see if there is anything common between them and based on that we can update the system instruction.

(1:11 - 1:37)
So, if you look at the explanations, what we will see is that mild symptoms are usually ignored by the LLM, it is as if if there is a mild symptom it thinks as if no there was no problem. So, it was usually concentrating only on the more severe symptoms that the patient experienced. So, what I am going to do in the next step in the next video is I am going to, so this is going to be version 6 of the class, right.

(1:38 - 2:11)
What I am going to do here is along with all the other things I have already done which is by the way I have removed the matching phrase because that was giving bad answers. So, I just have matching sentence for now in the explanation class along with the explanation string and there is the patient info class that we already had. What I am also going to do in the system prompt you can see that in the messages I am going to say role system and in the content I am going to say for days to symptom onset use the earliest date also include mild symptoms, ok.

(2:11 - 2:24)
And then there was one more thing I have noticed which is if the month is provided but date is not mentioned. So, let us say that it says the patient had some problem in June 2021, ok. So, there is no explicit mention of the day of the month, right.

(2:24 - 2:52)
In that case I ask the LLM to just use the first date of the month. So, in that case this would in this case it would be first June 2021 because that is what they usually do in the VAERS CSV files. So, that when we do the comparison we can see that if it got it right then the mismatch will be false that is it will be as if the LLM got an answer which matched with whatever was in the CSV file, ok.

(2:53 - 3:07)
So, this is what I have changed and otherwise everything is the same. So, we can run this as before and it will skip everything because the file has already been saved. Now, let us take a look at the accuracy of this.

(3:07 - 3:42)
We will go to mismatch equals false and once you scroll down you can see that now this is a really good jump. So, it has gone from whatever it was before 67 to 78, right which is actually pretty significant and remember that we started around 58 or something, right less than 60 percent and now it is almost 80 percent, ok. So, by adding the system instruction which is this is a fine way of prompt engineering we were actually able to get more accurate results.

(3:42 - 4:19)
So, this is to say that prompt engineering still works and it is not as if these things are exclusive, you do not have to like there it is not a disjoint set you can do both that is you can engineer the prompt, you can improve the system instruction and you can change the specific field names that you are requesting, you can add more information that for example the matching sentence. So, you can do a bunch of things to improve the accuracy that you get out of the box, right. So, these are all things that you can do to tweak your system so that you get more accurate results.

(4:19 - 4:53)
So, for now what I will say is that with all these things you can see that you are able to get better results compared to where we started which was at 58 percent and now we have reached which is something which is close to 80 percent, I will say that that is a pretty significant improvement although I guess some people will say that 80 percent is not that great considering that this is not one of those tasks which is especially complex. So, this is what we have and in the next lesson I am going to tell you something which is very interesting and something that you have to watch out for.

Remember to verify if the gold standard dataset is accurate

(0:00 - 1:03)
So, what we have done till now if you think about it, what we have done till now is we have just assumed that the information in the CSV file is the correct information and based on that we did this comparison and then when we got a mismatch to be true, we just declared that the LLM got the answer wrong right and that sort of decreased the accuracy. Now it is possible that the information that you have which is in general you think of it as the gold standard right your data set, but it is possible that in this particular case especially with the kind of complex information that we have in the VAERS system, it is even possible that the people who manually coded this information they could have gotten it wrong. And in this case I am going to show you a very specific kind of mistake that they make which is actually kind of interesting and also I find it a bit funny because it is a common pattern which you see and it seems to have gone unnoticed for a long time ok.

(1:03 - 2:10)
So, what am I talking about? So, let us go back to this last the patient info class we have, now I have it as V7, what I want to do now is I want to ask the LLM to get the date of the last vaccination ok and I wanted to provide an explanation for that as before which is the explanation class itself consists of both the explanation string as well as matching sentence if any right. So, this is now we need two fields, we are expecting the days to symptom onset as well as the date of last vaccination. What I also do is by the way, so I got all this information, so let us run this as before you know I have already downloaded it to my local machine, so it skipped all of it because the file has been saved, but now when we do the comparison in the itables library, what I am going to do here is I am going to display the vaccination date from the CSV file and then also display the vaccination date as calculated by the large language model.

(2:10 - 8:09)
So, you might be wondering like why do we need to do that? So, I am going to show you that in just a second ok. So, let us run this and what I am going to do here is I am going to go for mismatch equals true ok. So, these are the cases where we want to see if there is a mismatch ok.

Now, notice the cases where there is a big difference between well maybe that is not a good heuristic, let us just look at this particular example ok. Look at the CSV, the number of days calculated in the CSV file is 41, the LLM calculated it as 13 which is you might imagine that is a pretty big difference. Now, if you look at the CSV file, the vaccination date is 7 3 that is July 3rd 2021, but the LLM got it as 31st July 2021.

So, how could this be the case? Why is it getting the wrong information for the vaccination date? And you can see I also asked it for an explanation. So, this is where the explanation is really helpful because it says this is the matching sentence if you remember, it says on 31st July 2021, the patient received the second dose of this vaccine ok. Now, if you look at it from the perspective of 31st July 2021, the days to symptom onset and remember the explanation here is received the second dose on 31st July and 13th August is when they got the symptom.

So, the difference between 13th August and 31st July is 13 days which means that the LLM got it right provided it got the vaccination date correct right. So, now what I am going to do now is click into this and open the VAERS ID history to see what was actually going on ok. So, what you see here is that if you look at the information provided in the report, you will find that they mentioned somewhere.

So, here this is where it says on 3rd July this patient received the first dose of the vaccine. So, whoever was encoding this information into the CSV file, they just looked at this particular information and they stopped there. But you can see that there is another sentence right after that which says on 31st July 2021, the patient received the second dose of this vaccine.

Now, we always need to use the most recent vaccination date because that is that is how it is done with all these drug adverse event reports because we want to see how many days elapse between the most recent intervention which is the in this case is the second dose of the vaccine and when they develop symptom onset because if they are very close to each other then it is reason to be concerned right. So, now let us go back to this particular case. So, what we have here is if you calculate the information from 3rd July till 13th August, this information is correct the CSV, but if you get it from the correct date of vaccination the most recent one which is 31st July 2021, then this information is right.

So, this is an example where the LLM actually got the final answer correct, but our code marked it as if it was wrong. So, these are kind of things that you still need to be looking out for and you will see some more examples of that, but then there are also some cases where the LLM gets the vaccination date right, it gets the date of symptom onset right, but it just does a bad calculation I have seen that happen. And these are typical ones where you see a 0 in the CSV file and a 1 in the LLM because it does not know how to do calculations where something happens on the same day.

Anyway, so there are things like that which the LLM is also getting wrong. So, this is why I keep saying that XAI's large language model the Grok beta is not really as advanced as the other large language models at this moment. I think that it is improving very fast, but it is still not quite where it needs to be as reliable for this particular task of structured information extraction as the other large language models which have been released.

So, this is what you get and I am just looking for other examples of the same pattern. So, I think this is one more. So, 9-19-2021 and this is 2021-10-17.

So, you can see it is a 17th October, patient received the second dose and if you were to click through into this, you will probably find that they got the first dose on the September 19th. So, let us see on 17th yeah see the patient received this is the date on the first dose, this is the date of the second dose and we need to use the second dose for the calculation right. So, if you look at the days to symptom onset for this particular example, it should be just one day.

So, it happened the next day, but the CSV file gives you the wrong information. Anyway so, I hope that this makes it clear like you also need to have like a gold standard data set which does not have any errors. Now, unfortunately in this case whatever we are using the where CSV file it does have some errors already ok.

In fact, you can even use this LLMs to cross verify if all the data in the CSV files in where you can even use it to cross verify if it is correct to begin with right. So, anyway so, that is what we have and that is the last lesson in this series. So, what I have done here I have shown you why if your input information that is the source data that you are comparing with has some mistakes that could have some kind of a negative impact on your accuracy calculation because sometimes the LLM is getting it right and it is the comparison that is that is the source that you are comparing it with what you think is the gold standard is what is usually is what is sometimes wrong.

(8:10 - 8:26)
So, you also have to take that into account when you consider the accuracy. Now, given all that my guess is that the actual accuracy for the step that we are at doing all the other things that we have done it is probably more than 78 percent I would estimate that it is closer to 85 percent.