Improving Gemini Flash accuracy for structured information extraction
About the RECOVD field
(0:00 - 1:10)
Alright, so in this chapter we are going to look at improving the accuracy of Google Gemini LLM and we are going to be focusing on this field called recovered that is RECOVD is the column name in VAERS and it indicates if the patient fully recovered from all their symptoms at the time of report and what we have to do to be able to calculate this is we need to look at each individual symptom and then use the worst case outcome among them. So among no, unknown and yes the worst case outcome would be no and then something in the middle would be unknown and then yes is the best case right. So we have to look at the for each symptom what is the worst case outcome or we have to look at this outcome for each symptom and choose the one which is the worst case of these three okay and if all the symptoms have recovered as an example but just one of them is unknown the recovered field should still be unknown which is marked as U and recovering is considered as no.
So what we will do first is we will install these libraries in preparation for the rest of the material which you will see in this chapter.
Calculating the baseline accuracy
import pandas as pd
import google.generativeai as genai
import os
import json
from dotenv import load_dotenv
from enum import Enum
import time
load_dotenv()
api_key = os.getenv('GOOGLE_GEMINI_API_KEY')
genai.configure(api_key=api_key)
import typing_extensions as typing
class RecoveryStatus(Enum):
Yes = "Yes"
No = "No"
Recovering = "No"
Unknown = "Unknown"
class PatientInfo_v0(typing.TypedDict):
recovered_from_all_symptoms: RecoveryStatus
df: pd.DataFrame = pd.read_csv(f'csv/japan_100.csv')
file_name = 'gemini_recovered_map_v0.json'
with open(f'json/gemini/{file_name}', 'r') as f:
gemini_age_map = json.load(f)
num_rows = 100
for index, row in df.head(num_rows).iterrows():
symptom_text = row['SYMPTOM_TEXT']
vaers_id = str(row['VAERS_ID'])
if vaers_id in gemini_age_map.keys():
print(f'Skipped {vaers_id}')
continue
print(f'Processing {index} vaers_id = {vaers_id}')
start_time = time.time()
model = genai.GenerativeModel(
model_name="gemini-1.5-flash-latest",
)
result = model.generate_content(
f'''
Writeup:
{symptom_text}
''',
generation_config=genai.GenerationConfig(
response_mime_type="application/json", response_schema=PatientInfo_v0
),
)
end_time = time.time()
duration = end_time - start_time
response = json.loads(result.text)
response['duration'] = duration
gemini_age_map[vaers_id] = response
with open(f'json/gemini/{file_name}', 'w+') as f:
json.dump(gemini_age_map, f, indent=2)
(0:00 - 1:47)
Alright, so we will start by calculating the baseline value for the accuracy. And as before, I have already run this code and saved a copy of the results of this API calls in the project itself. And just I will just go over the code itself just to give you an overview of what is happening.
So you have this class called recovery status, which is actually an enum, right? And here you have these are the four values, yes, no, recovering and unknown. And I have just a single field called recovered from all symptoms in the patient info V0 version, okay. And then I do the usual stuff, iterate through the Japan 100 CSV file.
The file name which I will be saving it into will be called Gemini recovered map V0.json, okay. And once I, so I am using the Gemini 1.5 flash latest LLM, okay. And you can just run this, you will see that it just skips all these VAERS IDs because the information has already been saved.
And then we will do the accuracy calculation as before. Now, I just want to point out that I am going to take the value for the LLM recovered, that is what the LLM says as recovered is going to be in the Gemini recovered VAERS ID and then you get this recovered from all symptoms field, okay. And if there is no match, there will be a mismatch will be marked as true.
Also the status in the VAERS file is just Y, N and U while we use the words yes, no and unknown inside the enum. So, I just have a way to map these values to the corresponding words, okay. That is the get recovered string method I have used here.
(1:47 - 2:21)
So, now once I run this, you will see that there is a bunch of mismatches. And if the mismatch equals false, that means that you got, that is the LLM got the same value as what the CSV file is reporting. So, if mismatch equals false, so you see that there are 63 entries like that.
So, there are 37 where there was a mismatch, that is the LLM got a value, calculated a value which differs from what is in the CSV file while 63 times these values are matching, okay. This means that we start with the baseline accuracy of 63 percent.
Itemize composite field into individual values
import pandas as pd
import google.generativeai as genai
import os
import json
from dotenv import load_dotenv
from enum import Enum
import time
load_dotenv()
api_key = os.getenv('GOOGLE_GEMINI_API_KEY')
genai.configure(api_key=api_key)
import typing_extensions as typing
class RecoveryStatus(Enum):
Yes = "Yes"
No = "No"
Recovering = "No"
Unknown = "Unknown"
class Symptom(typing.TypedDict):
symptom_value: str
symptom_recovery_status: RecoveryStatus
class PatientInfo_v1(typing.TypedDict):
recovered_from_all_symptoms: RecoveryStatus
symptom_list: list[Symptom]
df: pd.DataFrame = pd.read_csv(f'csv/japan_100.csv')
file_name = 'gemini_recovered_map_v1.json'
with open(f'json/gemini/{file_name}', 'r') as f:
gemini_age_map = json.load(f)
num_rows = 100
for index, row in df.head(num_rows).iterrows():
symptom_text = row['SYMPTOM_TEXT']
vaers_id = str(row['VAERS_ID'])
if vaers_id in gemini_age_map.keys():
print(f'Skipped {vaers_id}')
continue
print(f'Processing {index} vaers_id = {vaers_id}')
model = genai.GenerativeModel(
model_name="gemini-1.5-flash-latest",
)
start_time = time.time()
result = model.generate_content(
f'''
Writeup:
{symptom_text}
''',
generation_config=genai.GenerationConfig(
response_mime_type="application/json", response_schema=PatientInfo_v1
),
)
end_time = time.time()
duration = end_time - start_time
response = json.loads(result.text)
response['duration'] = duration
gemini_age_map[vaers_id] = response
with open(f'json/gemini/{file_name}', 'w+') as f:
json.dump(gemini_age_map, f, indent=2)
(0:00 - 0:22)
As we are trying to increase the accuracy, one of the first things that we can try obviously is we can itemize a composite field like recovered into individual information. In other words, you already have a list of symptoms. So what you can do is you can ask Gemini to provide the list of symptoms and then see if that helps to improve the calculation.
(0:22 - 0:48)
So what I have done here is I have created a class called symptom which has the symptom value which is a string, a recovery status which is based on this class recovery status. Then recovered from all symptoms is as before, but I have also added the symptom list field which is just a list of symptoms ok. So that is the patient info underscore v1 class which is what I use in the response schema and I did the same things as in the previous lesson.
(0:48 - 1:20)
So if I run it, you will find that it is skipping all the VAERS IDs and it is already created the JSON file as you know. So now if you have to click on this to calculate the accuracy, what you will find is we can do the same thing mismatch equals false and the percentage is now only 62. It was 63 before it has come down slightly, although this is such a small difference that it does not really make much of a difference in terms of the overall numbers.
(1:20 - 2:01)
But what it does tell us is that doing this itemizing is not much of a help when it comes to the actual accuracy of the calculation. But you will find that doing this splitting into these multiple symptoms is still useful in a different way because you are now able to look at the individual symptoms for each of these VAERS reports ok. So in a way this does tell you that what it does allow you to infer is that the LLM is able to at least extract the list of symptoms from the text itself.
(2:01 - 2:41)
It just has because the text if you remember is just like a lot of dense information one after the other. So, if you have to take this example of this 1960792, let us just take this is the first one right and you can see that this is not matching, but if you were to click into it and open the actual VAERS report where you see where they describe the outcome, if you have to search for the word outcome you can see that this is actually like very dense text. So, the LLM has to be able to do a good job when it comes to parsing out each one of these symptoms right.
(2:42 - 3:00)
So, what I will say is the LLM is now at least we know that it is looking at this list of symptoms and I am hoping that it is actually extracting each one of these symptoms so that we can look at the composite value in the next lesson.
Get explanations for the itemized values
(0:00 - 0:17)
Merely itemizing the information was not very helpful when it comes to increasing the accuracy. What we can try next is to see if we can get an explanation based on the itemized information. So that is what I am doing in this particular lesson and notice that I have added this class called explanation.
(0:18 - 1:24)
It has a matching sentence if any str the string and then the explanation itself is another string and I have this symptom class where I have got symptom value is a string, symptom recovery status is of type recovery status and then I added symptom recovery status explanation as an explanation type ok and this is going to be my patient info v2, the other stuff is as before and we are going through this same code again otherwise the code is the same. So if you run this you can see that it is already downloaded this information to the local machine and now I am going to run the comparison and then what you see here is that there is this additional field which is called explanation which I pulled from the json file ok. Now if you were to use mismatch equals false what you find is that it is still at 62% right the accuracy remains at 62 but we have some insight into how to improve the accuracy because we can look at the explanation as well as information from the itemized list itself.
(1:24 - 2:03)
So if you change this to true, so these are the places where it changed. Notice that the CSV files value is no, the LLM says yes and why is it saying that because it says with subquotation treatment continued to be needed. So this is kind of what it is really telling us is that Gemini was not really able to understand what this says ok and again one more case where CSV recovered is no, LLM recovered is yes but it says the patient's outcome is described as recovering but if it is the patient's outcome is described as recovering then it cannot be recovered cannot be true.
(2:03 - 2:22)
So this is wrong right. So what we can do is by looking at the explanation we get some insight into how the LLM how Gemini is doing its internal inferences right. So we can use that to I guess you can use that to do better in the next step.
Explicitly specify all values inside Enum if possible
(0:00 - 2:18)
When we were looking at the comparison table in the previous lesson, you might have noticed that in one of the cases where the mismatch was true, the LLM was saying that the patient was recovering, but then it marked it as yes, although we know that if it says recovering, that must be also a no. So, one of the first things that comes to mind is why do not we explicitly put all this information inside the enum itself. So, what I did is I expanded the recovery status class which is an enum, I have this additional information now.
So, if it is yes, then yes, if it is recovered, it is yes, recovered with treatment is yes, but if it is recovering, it is no and resolving is one more word I found is also a no and recovering with treatment is also a no because they are still recovering right. So, otherwise the information is otherwise this code is pretty much the same, we have this marked as V3, but the code itself is the same, it is just that I have improved this enum class. So, let us run this and you see that the file is already been saved.
Now, if you were to run the comparison and we go for this mismatch equals false and we look at the value, it is still 63, it is not going to make any big difference, but what I want to do now is I want to look at the cases where it is true that is the mismatch is true, I want to look at the explanation right. Now, let us take a look at the CSV file says recovered is no, LLM says recovered is yes and it says the patient symptoms resolved and on an unspecified date the outcome of the event was recovering. That is even though the outcome was recovering, the explanation says that the patient symptoms resolved which is which is kind of hinting that whatever we have whatever the matching sentence is saying, the LLM is not able to take the information in the matching sentence and appropriately calculate the correct value for some reason and you will find a few examples of that if you go through this list and you know after you do this filtering for the condition.
So, let us take a look at how we can fix that in the next lesson.
Prompt Engineering: Improve the system instruction to specify complex logic
(0:00 - 0:37)
So, we made a whole bunch of changes in the last 3 lessons and till now the results have not been very satisfactory, but at each step we learnt some additional information which we could use for improving the, you can say improving the system on the whole right. So, first of all we have this class which has been expanded to have more information, then we itemized the list of symptoms, we asked for an explanation for each item and based on all this what we can do is we can actually improve the system instruction. Earlier there was no system instruction, it was just empty, now this is the instruction that I am providing.
(0:38 - 1:05)
So, I say you are a biomedical expert and then I specifically mention that the worst outcome is no, less worst outcome is unknown, best outcome is yes. I ask it to always choose the worst outcome for recovered from all symptoms and then use the following rules in order and choose the first one which applies, again it is going from worst to best right. If the outcome of any of the symptoms is no, then recovered from all symptoms is no or you know not recovered.
(1:05 - 1:42)
If the outcome of any of the systems is recovering or resolving, then recovered from all symptoms is no. If the outcome of any of the systems is unknown, then recovered from all symptoms is unknown and if finally, if the outcome of all the symptoms is recovered or recovered with treatment only then recovered from all symptoms should be marked as yes ok. And let us go ahead and run this and it is already saved it down to the local machine and now if you have to do a calculation for this accuracy, let us go for mismatch equals false.
(1:42 - 2:04)
So, what you find here is that it is a pretty big jump, it is now gone from about 63 percent to 82 percent right. So, now 82 percent of the times the LLM is agreeing with whatever is in the CSV file ok. So, that is a pretty big significant boost in terms of the accuracy and we were able to achieve this because of a very good system instruction.
(2:05 - 2:43)
So, after trying multiple steps the one which really clicked everything into place was the fact that we were able to provide a system instruction which is very detailed and gave the LLM like a step by step kind of a recipe to follow right. And once you have this information, so you can see that the system instruction could only have been this detailed because we went through the last few steps. If we did not then the system instruction we could have done some small improvements, but I do not think that we would have had enough insight into what is going on to be able to give such a you know clear and detailed system instruction.
(2:43 - 3:24)
So, this is why you have to also think about when you are doing all these things you have to think about breaking down the problem itself into sub problems into smaller pieces and you find that when you do it. So, this is kind of an iterative process which usually even our own brain is going to apply when we are trying to do this kind of a reasoning and when we try to do this kind of inference right. So, think of how you translate that into steps which the machine or this large language model can do and usually doing it in that kind of a step by step order will usually improve the accuracy as you might have noticed.
Test “less powerful” LLM models
(0:00 - 2:09)
Now, I have been running all these lessons, I have been using the Gemini Flash model. Now, the interesting thing is that initially I started creating this chapter lessons based on Gemini Pro and why did I do that? That is because Google claims that Gemini Pro is its most powerful model. Then I found that Gemini Pro is actually much less accurate than Gemini Flash and the funny part is Flash is not only faster, but it is actually much cheaper.
I guess the result of this is that sometimes it is a good idea to check the less powerful model also and see its results. You want to do it on a smallish data set, do not put too much time or effort into that, but it is always a good idea to check multiple models because it is possible that in this case, I guess that is what happened. Gemini Pro could have had the best results in terms of a specific benchmark data set which may not have had anything to do with the task that you have at hand.
So, for this example of extracting information from this biomedical data set called VAERS, we find that Gemini Flash is actually doing a much better job and Gemini Pro was not very useful. So, what I did is I just took the same code and you can see that I just changed this to Gemini 1.5 Pro latest and I am going to run this and of course, this has already been downloaded and then I am going to look at the accuracy. Let us go here, look at mismatch equals false and you find that this is only 72 percent.
The previous for Gemini Flash, it was 82 percent. So, the accuracy is much lower than what we got with Gemini Flash and in fact, if you were to do this, let us try now, let us say mismatch equals true and you can see that it is here, it says the LLM recovered says it is no. The explanation is since the outcome of Syncope is recovered and the Hypothesia is recovered, Pallor is recovered, all of them is recovered, the final outcome is no.
(2:10 - 2:27)
So, that does not make any sense. If everything is recovered, then the CSV recovered should have been, I mean the LLM recovered should have been yes, just like what you have in the CSV file. So, it made a blunder here and this is the kind of stuff that I found quite often with Gemini Pro.
(2:27 - 2:51)
Gemini Flash also does these kind of mistakes, but it is just far less frequent. That is the errors are fewer in Gemini Flash and I found that Gemini Pro was making more errors for this particular data set. So, this is why it actually makes sense to test the so called less powerful models because they might end up being more powerful for your particular use case.