10 tips to improve the accuracy of LLM structured outputs

Here are some tips you can use to improve the accuracy of structured outputs from LLM.

Note: you can use the Python instructor library to get structured outputs from nearly any Large Language Model.

Use a more descriptive field name

You can improve the accuracy of structured information extraction by providing a more descriptive field name. For example, I was able to increase the accuracy of extracting the Outcome from a VAERS writeup by changing the field name from “days_to_symptom_onset” to the more descriptive “days_to_earliest_symptom_onset”

class PatientInfo_v1(BaseModel):
    days_to_earliest_symptom_onset: int

Use Field description

The instructor library is built on top of pydantic which allows you to define a field description.

This can also help improve the accuracy of the LLM’s structured outputs.

class PatientInfo_v2(BaseModel):
    days_to_symptom_onset: int = Field(description="If there are multiple symptoms, choose the earliest one")

Add an _explanation field

You can add an _explanation field based on other fields names which also improves the accuracy of the LLM (probably because it is forced to provide a reason for the answer)

class PatientInfo_v3(BaseModel):
    days_to_symptom_onset: int = Field(description="If there are multiple symptoms, choose the earliest one")
    days_to_symptom_onset_explanation: str = Field(description="Explain why you chose the value for the corresponding field")

Get matching sentence using an Explanation class

You can also define an Explanation class instead of a simple string data type, and ask the LLM to find the matching sentence from the input prompt. In addition to forcing the LLM to reason some more about the answer, this is also a handy debugging tool.

class Explanation(BaseModel):
    matching_sentence_if_any: str = Field(description='Please output the sentence which supports the explanation')
    explanation: str

class PatientInfo_v4(BaseModel):
    days_to_symptom_onset: int = Field(description="If there are multiple symptoms, choose the earliest one")
    days_to_symptom_onset_explanation: Explanation

Get the matching phrase inside an Explanation class

If you have very long sentences (which is quite common in the VAERS writeup), it also helps if you ask the LLM to get the specific matching phrase. Sometimes this does not make a difference to the accuracy per se, but it can still help you debug the response more quickly.

class Explanation(BaseModel):
    matching_sentence_if_any: str = Field(description='Please output the sentence which supports the explanation')
    matching_phrase_if_any: str = Field(description='Please output the specific phrase which supports the explanation')
    explanation: str

class PatientInfo_v5(BaseModel):
    days_to_symptom_onset: int = Field(description="If there are multiple symptoms, choose the earliest one")
    days_to_symptom_onset_explanation: Explanation

Prompt Engineering – update system instruction based on Explanations

Prompt engineering is of course quite well known.

But if you use my tips you will notice that getting a few explanations first will make it much easier to update the system instruction because you can see how the LLM reasons about the information extraction. Unfortunately this is not guarantees to fix the mistakes, but you will usually find a pretty good reduction in the number of errors.

pinfo = client.chat.completions.create(
            model="grok-beta",
            response_model=PatientInfo_v6,
            messages=[{
                "role": "system",
                "content": f"""
                For days_to_symptom_onset use the earliest date. Also include mild symptoms. 
                If the month is provided but date is not mentioned, use the first date of the month
                as the date of symptom onset.
                """,
            },
{"role": "user", "content": f'{symptom_text}'}],
        )

Itemize composite fields into individual information

Sometimes you need to get a composite field like the RECOVD status in the VAERS report.

This is usually populated with the worst outcome among all the symptoms listed – Y for Yes, N for No and U for Unknown.

So getting this list of symptoms first can be helpful when debugging the value of the field.

Note: the rest of the examples are based on Gemini's API client which does not use pydantic
class RecoveryStatus(Enum):
    Yes = "Yes"
    No = "No"
    Recovering = "No"
    Unknown = "Unknown"

class Symptom(typing.TypedDict):
    symptom_value: str
    symptom_recovery_status: RecoveryStatus

class PatientInfo_v1(typing.TypedDict):
    recovered_from_all_symptoms: RecoveryStatus
    symptom_list: list[Symptom]

Get explanation based on itemized information

It follows that once you ask for an itemized list for these fields, you should also considering adding an explanation for each item (but remember that this can increase the number of output tokens quite substantially)

class RecoveryStatus(Enum):
    Yes = "Yes"
    No = "No"
    Recovering = "No"
    Unknown = "Unknown"

class Explanation(typing.TypedDict):
    matching_sentence_if_any: str
    explanation: str

class Symptom(typing.TypedDict):
    symptom_value: str
    symptom_recovery_status: RecoveryStatus
    symptom_recovery_status_explanation: Explanation

class PatientInfo_v2(typing.TypedDict):
    recovered_from_all_symptoms: RecoveryStatus
    recovered_from_all_symptoms_explanation: str
    symptom_list: list[Symptom]

Explicitly specify all choices inside Enums

You will usually use Enums to represent categorical data.

When you do this, consider adding all the different values for the categorical value – including words and phrases which seem very similar and obvious to you. Usually it will not be very obvious for the LLM.

class RecoveryStatus(Enum):
    Yes = "Yes"
    Recovered = "Yes"
    Recovered_With_Treatment = "Yes"
    No = "No"
    Recovering = "No"
    Resolving = "No"
    Recovering_With_Treatment = "No"
    Unknown = "Unknown"

class Explanation(typing.TypedDict):
    matching_sentence_if_any: str
    explanation: str

class Symptom(typing.TypedDict):
    symptom_value: str
    symptom_recovery_status: RecoveryStatus
    symptom_recovery_status_explanation: Explanation

class PatientInfo_v3(typing.TypedDict):
    recovered_from_all_symptoms: RecoveryStatus
    recovered_from_all_symptoms_explanation: Explanation
    symptom_list: list[Symptom]

Test “less powerful” models

One of the most interesting things I noticed while creating my course was that Gemini Flash is both faster and more accurate than Gemini Pro even though it is about 20X cheaper!

So I recommend testing the older models for the task at hand – sometimes you might be surprised by the results.

Prefer Enums to string values

Sometimes you might choose to use string values instead of Enums so you can more faithfully capture the information in the input prompt.

I think it is usually better to prefer Enums to string values (as long as you are actually dealing with a categorical variable) because it will provide a lot of downstream benefits.

Here is an example of the full list of outcomes which Gemini calculated (as you can see this also includes some clearly wrong inferences, such as the date value)

recovered
improved
unspecified
Unknown
Fatal
Recovering
Recovered with sequelae
recovering
not recovered
resolving
03Feb2022 (323 days after vaccination)
resolved
14Feb2022 (334 days after vaccination)
recovered with sequelae
ongoing
unknown
Not recovered
Recovered

If you instead converted all these to Enums it can help avoid complex string matching as well having to handle all kinds of corner cases that you never anticipated.

This means sometimes the LLM’s answer may not map properly to one of the Enum values – it is a sign that you need to get a better understanding of your input data if this happens too often, and not a sign that you should switch to using a string.

You can check out my Udemy course where I go over each of these tips in detail.

Similar Posts

One Comment

Leave a Reply