Advanced structured Output -Use case: accident research

Hey everyone!

I already made an initial post to my problem here. I am looking for improvements for my ChatGPT based application. The application receives an accident report and a taxonomy as input.
The peculiarity compared to pure text classification / information extraction is that the taxonomy makes restrictions on the values that ChatGPT may return.
simplyfied Example:

response =
        response_format= {"type": "json_object"},
                "role": "system",
                "content": f"{instruction}{taxonomy} "
                "role": "user",
                "content": f"{report}"
instruction = ("You are a paraglider safety expert. "
                   "You want to classify accident reports. "
                   "Respond only in JSON format. Only Output attributes that are known."
                   "Use only one attribute per key. "
                   "To classify you may only use the attributes provided in this taxonomy: \n")
  "report_as": [
  "flight_type": [
  "age": "number"

This is only an excerpt. The taxonomy I am using holds 48 elements (1740 tokens)

For data protection reasons I cannot post a real accident report here in the forum, but you can imagine it as a natural language text between 50-1200 words.
My Approach:
This is obviously a difficult task for an LLM. It has to extract information, compare it with the taxonomy, form a valid JSON and do all this for a relatively large schema.
My first approach was to integrate everything into a prompt as shown above and use the JSON mode and the large input token set of gpt-4-turbo-preview and gpt-3.5-turbo-1106.
This led to convincing initial results. The format is correct in almost all cases and the model hallucinates very little.

  1. determinism
    The model output is not uniform if, for example, I make 5-10 repetitions per accident report. It sometimes differs by up to 4 elements found or not found.
    I have already read a lot about this in the forum (e.g. 1, 2, 3) and think that I will have to accept this despite seed, fingerprint and temperature close to or equal to 0.
  2. recall
    Unfortunately, the model finds too few elements. Especially something like “report_as” is often only given indirectly. For example, a report is written in the first-person perspective, which is why it is clear that the pilot is also the author of the report. However, this is often not clear to the model. I would like to try to improve this.

What I have tried so far:

  1. langchain
  • chain = create_tagging_chain(schema, llm)

  • chain = create_extraction_chain(schema, llm)

I tried both methodes with different schema representations. I represented the taxonomy as a JSON-Schema with and without annotations; and as a Pydantic Object.
Furthermore I tried different strategies from this youtube-tutorial you can test it for yourself in this collab.

  1. Function Calling

Following this article I tried using one general information_extraction function (which essentially is the whole taxonomy) and i tried running multiple functions with different subsections of the taxonomy. (one function for weather_related stuff one for pilot_attributes, etc.)
Side Note: I am aware, that functions are depreciated and are now replaced by tools. I adapted this when experimenting with function calls.

  1. Hyperparamter tuning

I experimented with different temperatures and top_p values as suggested here

  1. Prompt engineering

Obviously i also experimented with different formulations and Chain of thought. few-shot is hardly an option because, the reports are very different and my api calls already are very large :wink:

  1. Multi-Prompting

Currently I am trying to split my taxonomy into different sections and make a specialized prompt for them. Similar to multiple function calling, in one API-call the instruction is

instruction = ("You are a renowned paragliding safety expert. "
               "You must search an accident report for information about the harness.
                ...... "

As the taxonomy I only provide the elements regarding the harness.

Interestingly, all previous experiments have led to similar or worse results.
I am surprised by this and wonder whether there is a more promising method for my problem?
My main finding while experimenting is that the different frameworks or methods scale poorly. Often examples are much more rudimentary and free than what I am asking of the model here. When I then apply these methods to my problem, I have the feeling that it simply doesn’t work as well due to the size of the taxonomy and the complexity of the texts.

Long story short:
I am looking for a method to improve my initial prompt or the task I set the LLM. I have given an overview of what I have tried and hope one of the readers here might have an idea for me.

1 Like

Hi @LeFlob - to me this looks like it might be a good candidate for a fine-tuned gpt 3.5 turbo model.

Given you’ve already achieved some promising results just by integrating it into a prompt, through the fine-tuning you should be able to address the issues that have surfaced then.

Your training examples would consist of your existing system prompt incl. the taxonomy, the existing user message (i.e. the report) and then your desired output in JSON format.

You could give it a try with maybe just 20-30 examples to see if it could work. Make sure to include those cases where you’ve previously experienced issues. If it works, you can subsequently expand your training data set for even more refined results.


Interesting! I rejected the idea of fine tuning at the beginning, as the effort involved seemed disproportionately high to me. But I would like to try it out, do you happen to have a guide on how to do it? I’m not really familiar with it

1 Like

Sure. There’s a couple of resources available:

As said, you don’t want to just rush into creating a huge dataset. I’ve found that you can often test the hypothesis if a task is suitable for fine-tuning with as little as 20-30 examples (minimum is 10 examples).

1 Like

Super interesting use-case, good luck!
Would love to followup and hear how this ends up going!

Here’s my 2 cents:

I’d try adding a short description for each expected parameter, along with the schema, in a similar way to how function calling parameters are described. something like:

"report_as": enum of ["pilot", "flight_school_flight_instructor"...]
The role of the person reporting.

"age": number
The age of...

(sorry if my descriptions are inaccurate)

On top of that, you can later identify frequent parameters hallucinations and refer to these in the description specifically.

I’d also possibly try again to use function (tool) call to see how if this happens to work better.

Last, I’d suggest trying out Promptotype which is a platform built for the specific use-case of structured prompts and provides many related features including batch testing a set of queries and their expected responses (full disclosure: I’m the creator of it :))

Good luck!


I´ll try my best keep you guys updated. I spent the last 3 hours getting a training data set and now currently my first fine tuning job is startet.
Will test that tomorrow.

I also checked out your site, lets see if I can improve my Use Case !

Thanks for your reply