Advanced structured Output -Use case: accident research

LeFlob · June 20, 2024, 4:30pm

Hello,
I have carried out a somewhat larger evaluation. I tried 22 accident reports with 4 prompt engineering methods and 2 function calling methods. In each case the schema was generated via Pydantic. Each accident report was analysed 5 times per method with the same experimental conditions (seed, model, temp, etc.).
Basically, I found that function calling delivered significantly better results. For the implementation, I followed the tips and documentation from Jason Liu, the author of the “Instructor” package. You can also learn a lot from his free WandDB course.

One interesting observation during my tests was that function calling generally had a high precision. Meaning: Entities that were extracted were usually correct. However, the recall was around 40-50%. In other words, only 50-60% of what could theoretically have been found was found by the model.

My assumption was that the schema is simply too large. (My schema had 55 keys and over 170 values). Therefore, in my second approach, I did not create one function for the whole schema, but split the schema into 5 “subfunctions”. For example, one function only queried key values related to the weather, the other only queried pilot-specific entities,…

At first I wanted to do this only with the tools parameter of OpenAI, but in my tests it didn’t work well when I gave gpt3.5 or 4 multiple functions within one call. Therefore I used the library “asynchio”. This allowed me to execute 5 API calls at the same time (I think this also works with batches but for me asynchio was the easier solution).
So I executed a separate API call for each of the 5 functions and merged the JSON strings that came back.
The recall, especially for GPT-4, increased with this approach to over 60% with a precision of 92%. This was the best result of my evaluation.

If I were to improve this further now, I would probably only do function calling. Pydantic and Instructor are great in this context! I defined my Pydantic class in this style:

class extract_accident_info_literals(BaseModel):
    report_as: Optional[Literal["pilot", "flight_school_flight_instructor", "witness", "authority", "passenger", "other", "unknown"]] 
= Field(default= None, 
description="Who is reporting the incident?, e.g. I was doing .... = report_as: pilot")

country: Optional[str] 
= Field(default= None, 
description="Only Coutry code e.g. Chile = CL")

....

So I used Literals and Pydantics Field values. I think especially the description is incredibly important for the LLM.

Hope this helps

Topic		Replies	Views
Accident reports to unified taxonomy: A multi-class-classification problem Prompting embeddings , gpt-4 , chatgpt , classification , semantic-search	15	1116	August 9, 2024
Structured Outputs Deep-dive API api , structured-output	40	9152	September 16, 2024
Why aren’t we talking more about attribute-specific prompting in Structured Outputs? API json-mode , structured-output	10	559	August 19, 2024
Gpt-3.5-turbo fine tuning help needed, very difficult situations API	6	1693	September 13, 2023
Converting a ReAct prompt to use function-calling? API	6	7041	July 18, 2023

Advanced structured Output -Use case: accident research

Related topics