Advanced structured Output -Use case: accident research

Hello,
I have carried out a somewhat larger evaluation. I tried 22 accident reports with 4 prompt engineering methods and 2 function calling methods. In each case the schema was generated via Pydantic. Each accident report was analysed 5 times per method with the same experimental conditions (seed, model, temp, etc.).
Basically, I found that function calling delivered significantly better results. For the implementation, I followed the tips and documentation from Jason Liu, the author of the “Instructor” package. You can also learn a lot from his free WandDB course.

One interesting observation during my tests was that function calling generally had a high precision. Meaning: Entities that were extracted were usually correct. However, the recall was around 40-50%. In other words, only 50-60% of what could theoretically have been found was found by the model.

My assumption was that the schema is simply too large. (My schema had 55 keys and over 170 values). Therefore, in my second approach, I did not create one function for the whole schema, but split the schema into 5 “subfunctions”. For example, one function only queried key values related to the weather, the other only queried pilot-specific entities,…

At first I wanted to do this only with the tools parameter of OpenAI, but in my tests it didn’t work well when I gave gpt3.5 or 4 multiple functions within one call. Therefore I used the library “asynchio”. This allowed me to execute 5 API calls at the same time (I think this also works with batches but for me asynchio was the easier solution).
So I executed a separate API call for each of the 5 functions and merged the JSON strings that came back.
The recall, especially for GPT-4, increased with this approach to over 60% with a precision of 92%. This was the best result of my evaluation.

If I were to improve this further now, I would probably only do function calling. Pydantic and Instructor are great in this context! I defined my Pydantic class in this style:

class extract_accident_info_literals(BaseModel):
    report_as: Optional[Literal["pilot", "flight_school_flight_instructor", "witness", "authority", "passenger", "other", "unknown"]] 
= Field(default= None, 
description="Who is reporting the incident?, e.g. I was doing .... = report_as: pilot")

country: Optional[str] 
= Field(default= None, 
description="Only Coutry code e.g. Chile = CL")

....

So I used Literals and Pydantics Field values. I think especially the description is incredibly important for the LLM.

Hope this helps :wink:

3 Likes