Advanced structured Output -Use case: accident research

Hey everyone!

I already made an initial post to my problem here. I am looking for improvements for my ChatGPT based application. The application receives an accident report and a taxonomy as input.
The peculiarity compared to pure text classification / information extraction is that the taxonomy makes restrictions on the values that ChatGPT may return.
simplyfied Example:

response = client.chat.completions.create(
        model="gpt-3.5-turbo-1106",
        response_format= {"type": "json_object"},
        messages=[
            {
                "role": "system",
                "content": f"{instruction}{taxonomy} "
            },
            {
                "role": "user",
                "content": f"{report}"
            }
        ],
        seed=42,
        temperature=0,
        max_tokens=4095,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )
instruction = ("You are a paraglider safety expert. "
                   "You want to classify accident reports. "
                   "Respond only in JSON format. Only Output attributes that are known."
                   "Use only one attribute per key. "
                   "To classify you may only use the attributes provided in this taxonomy: \n")
{
  "report_as": [
    "pilot",
    "flight_school_flight_instructor",
    "other",
    "authority",
    "witness",
    "passenger",
    "unknown"
  ],
  "flight_type": [
    "cross_country_flight",
    "local_flight",
    "training_flight",
    "assisted_flying_flight_travel_training",
    "competition_flight",
    "passenger_flight",
    "safety_training_flight",
    "unknown"
  ],
  "age": "number"
...

This is only an excerpt. The taxonomy I am using holds 48 elements (1740 tokens)

For data protection reasons I cannot post a real accident report here in the forum, but you can imagine it as a natural language text between 50-1200 words.
My Approach:
This is obviously a difficult task for an LLM. It has to extract information, compare it with the taxonomy, form a valid JSON and do all this for a relatively large schema.
My first approach was to integrate everything into a prompt as shown above and use the JSON mode and the large input token set of gpt-4-turbo-preview and gpt-3.5-turbo-1106.
This led to convincing initial results. The format is correct in almost all cases and the model hallucinates very little.
Problems:

  1. determinism
    The model output is not uniform if, for example, I make 5-10 repetitions per accident report. It sometimes differs by up to 4 elements found or not found.
    I have already read a lot about this in the forum (e.g. 1, 2, 3) and think that I will have to accept this despite seed, fingerprint and temperature close to or equal to 0.
  2. recall
    Unfortunately, the model finds too few elements. Especially something like “report_as” is often only given indirectly. For example, a report is written in the first-person perspective, which is why it is clear that the pilot is also the author of the report. However, this is often not clear to the model. I would like to try to improve this.

What I have tried so far:

  1. langchain
  • chain = create_tagging_chain(schema, llm)

  • chain = create_extraction_chain(schema, llm)

I tried both methodes with different schema representations. I represented the taxonomy as a JSON-Schema with and without annotations; and as a Pydantic Object.
Furthermore I tried different strategies from this youtube-tutorial you can test it for yourself in this collab.

  1. Function Calling

Following this article I tried using one general information_extraction function (which essentially is the whole taxonomy) and i tried running multiple functions with different subsections of the taxonomy. (one function for weather_related stuff one for pilot_attributes, etc.)
Side Note: I am aware, that functions are depreciated and are now replaced by tools. I adapted this when experimenting with function calls.

  1. Hyperparamter tuning

I experimented with different temperatures and top_p values as suggested here

  1. Prompt engineering

Obviously i also experimented with different formulations and Chain of thought. few-shot is hardly an option because, the reports are very different and my api calls already are very large :wink:

  1. Multi-Prompting

Currently I am trying to split my taxonomy into different sections and make a specialized prompt for them. Similar to multiple function calling, in one API-call the instruction is

instruction = ("You are a renowned paragliding safety expert. "
               "You must search an accident report for information about the harness.
                ...... "

As the taxonomy I only provide the elements regarding the harness.

Results
Interestingly, all previous experiments have led to similar or worse results.
I am surprised by this and wonder whether there is a more promising method for my problem?
My main finding while experimenting is that the different frameworks or methods scale poorly. Often examples are much more rudimentary and free than what I am asking of the model here. When I then apply these methods to my problem, I have the feeling that it simply doesn’t work as well due to the size of the taxonomy and the complexity of the texts.

Long story short:
I am looking for a method to improve my initial prompt or the task I set the LLM. I have given an overview of what I have tried and hope one of the readers here might have an idea for me.

1 Like

Hi @LeFlob - to me this looks like it might be a good candidate for a fine-tuned gpt 3.5 turbo model.

Given you’ve already achieved some promising results just by integrating it into a prompt, through the fine-tuning you should be able to address the issues that have surfaced then.

Your training examples would consist of your existing system prompt incl. the taxonomy, the existing user message (i.e. the report) and then your desired output in JSON format.

You could give it a try with maybe just 20-30 examples to see if it could work. Make sure to include those cases where you’ve previously experienced issues. If it works, you can subsequently expand your training data set for even more refined results.

2 Likes

Interesting! I rejected the idea of fine tuning at the beginning, as the effort involved seemed disproportionately high to me. But I would like to try it out, do you happen to have a guide on how to do it? I’m not really familiar with it

1 Like

Sure. There’s a couple of resources available:

https://platform.openai.com/docs/guides/fine-tuning

https://platform.openai.com/docs/api-reference/fine-tuning

As said, you don’t want to just rush into creating a huge dataset. I’ve found that you can often test the hypothesis if a task is suitable for fine-tuning with as little as 20-30 examples (minimum is 10 examples).

1 Like

I´ll try my best keep you guys updated. I spent the last 3 hours getting a training data set and now currently my first fine tuning job is startet.
Will test that tomorrow.

I also checked out your site, lets see if I can improve my Use Case !

Thanks for your reply

3 Likes

Hello,
I’m very interested in your Use Case did you find out what worked best for you for forcing the llm to answer in the given format. I’m currently using langgchain with_structured_output with a pydantic v2 schema

Hello,
I have carried out a somewhat larger evaluation. I tried 22 accident reports with 4 prompt engineering methods and 2 function calling methods. In each case the schema was generated via Pydantic. Each accident report was analysed 5 times per method with the same experimental conditions (seed, model, temp, etc.).
Basically, I found that function calling delivered significantly better results. For the implementation, I followed the tips and documentation from Jason Liu, the author of the “Instructor” package. You can also learn a lot from his free WandDB course.

One interesting observation during my tests was that function calling generally had a high precision. Meaning: Entities that were extracted were usually correct. However, the recall was around 40-50%. In other words, only 50-60% of what could theoretically have been found was found by the model.

My assumption was that the schema is simply too large. (My schema had 55 keys and over 170 values). Therefore, in my second approach, I did not create one function for the whole schema, but split the schema into 5 “subfunctions”. For example, one function only queried key values related to the weather, the other only queried pilot-specific entities,…

At first I wanted to do this only with the tools parameter of OpenAI, but in my tests it didn’t work well when I gave gpt3.5 or 4 multiple functions within one call. Therefore I used the library “asynchio”. This allowed me to execute 5 API calls at the same time (I think this also works with batches but for me asynchio was the easier solution).
So I executed a separate API call for each of the 5 functions and merged the JSON strings that came back.
The recall, especially for GPT-4, increased with this approach to over 60% with a precision of 92%. This was the best result of my evaluation.

If I were to improve this further now, I would probably only do function calling. Pydantic and Instructor are great in this context! I defined my Pydantic class in this style:

class extract_accident_info_literals(BaseModel):
    report_as: Optional[Literal["pilot", "flight_school_flight_instructor", "witness", "authority", "passenger", "other", "unknown"]] 
= Field(default= None, 
description="Who is reporting the incident?, e.g. I was doing .... = report_as: pilot")

country: Optional[str] 
= Field(default= None, 
description="Only Coutry code e.g. Chile = CL")

....

So I used Literals and Pydantics Field values. I think especially the description is incredibly important for the LLM.

Hope this helps :wink:

3 Likes

Be careful setting defaults because this gives the model an easy-way-out and can make it lazy on the data extraction. It’s best to make it explicitly define the conditions, even if it is not known. It’s also helpful to know that most, if not all of the frameworks that use pydantic are dumping the schema directly from it, and while some remove the redundant title, none of them (that I have seen) resolve $ref and $defs, anyOF, or allOF. This can be problematic since the models are trained on flattend schemas. In other words, BaseModel.model_json_schema does not always produce the schemas in the way they are represented in the training datasets.

To overcome these limitations you can subclass GenerateJsonSchema and override the generate method.

from pydantic.json_schema import GenerateJsonSchema


class ParamsSchemaGenerator(GenerateJsonSchema):
    """
    Custom schema generator that:
        1. Inlines all references for LLM tools (resolves refs/defs).
        2. Reorders keys for LLM optimization.
        3. Removes redundant titles from the schema for token savings.

    Attributes:
        key_order (List[str]): Order of keys for optimization.
        is_reordered_keys (bool): Flag to enable key reordering.
        is_removed_titles (bool): Flag to enable title removal.

    Methods:
        generate(schema, mode="serialization"): Generate the optimized JSON schema.
        _inline_references(schema, definitions): Inline references in the schema.
        _reorder_keys(schema): Reorder keys in the schema.
        _remove_titles(schema): Remove titles from the schema.
    """

    key_order: List[str] = [
        "name",
        "title",
        "type",
        "format",
        "enum",
        "description",
        "properties",
        "required",
        "items",
    ]
    is_reordered_keys: bool = True
    is_removed_titles: bool = True
    _metadata = {}

    @classmethod
    def _add_metadata(cls, **metadata):
        cls._metadata.update(metadata)

    def generate(self, schema, mode="serialization"):
        json_schema = super().generate(schema, mode)
        if "title" in json_schema:
            # json_schema["__name"] = json_schema.pop("title")
            self._add_metadata(name=json_schema.pop("title"))
        if "description" in json_schema:
            # json_schema["__description"] = json_schema.pop("description")
            self._add_metadata(description=json_schema.pop("description"))
        if "$defs" in json_schema:
            definitions = json_schema.pop("$defs")
            json_schema = self._inline_references(json_schema, definitions)
        json_schema = self._inline_all_of(json_schema)
        if self.is_reordered_keys:
            json_schema = self._reorder_keys(json_schema)
        if self.is_removed_titles:
            json_schema = self._remove_titles(json_schema)
        return json_schema

    def _inline_references(self, schema, definitions):
        if isinstance(schema, dict):
            for key, value in list(schema.items()):
                if key == "$ref":
                    ref_key = value.split("/")[-1]
                    schema.update(definitions[ref_key])
                    schema.pop("$ref")
                    return self._inline_references(schema, definitions)
                else:
                    schema[key] = self._inline_references(value, definitions)
        elif isinstance(schema, list):
            return [self._inline_references(item, definitions) for item in schema]
        return schema

    def _reorder_keys(self, schema):
        if not isinstance(schema, dict):
            return schema
        ordered_dict = {k: schema.pop(k) for k in self.key_order if k in schema}
        # Add remaining keys
        ordered_dict.update({k: self._reorder_keys(v) for k, v in schema.items()})
        return {k: self._reorder_keys(v) for k, v in ordered_dict.items()}

    def _remove_titles(self, schema):
        if isinstance(schema, dict):
            new_dict = {}
            for key, value in schema.items():
                if key == "title" and isinstance(value, str):
                    continue  # Skip string titles
                new_dict[key] = self._remove_titles(value)
            return new_dict
        elif isinstance(schema, list):
            return [self._remove_titles(item) for item in schema]
        return schema

    def _inline_all_of(self, schema):
        """Inlines allOf schemas if the allOf list contains only one item."""
        if isinstance(schema, dict):
            if "allOf" in schema and len(schema["allOf"]) == 1:
                # Replace the allOf construct with its single contained schema
                inlined_schema = self._inline_all_of(schema["allOf"][0])
                # If the inlined schema is a dictionary, merge it with the current schema
                if isinstance(inlined_schema, dict):
                    schema.update(inlined_schema)
                    schema.pop("allOf")
                return schema
            # Recursively apply this method to all dictionary values
            for key, value in schema.items():
                schema[key] = self._inline_all_of(value)
        elif isinstance(schema, list):
            # Recursively apply this method to all items in the list
            return [self._inline_all_of(item) for item in schema]
        return schema

Then you can subclass BaseModel and override the model_json_schema

class MyBaseModel(pydantic.BaseModel):
    """Subclass of `pydantic.BaseModel` that provides additional functionality for LLM tool schema generation."""

    @classmethod
    def model_json_schema(cls, schema_generator=ParamsSchemaGenerator, **kwargs):
        mode = 'serialization'
        return super().model_json_schema(
            mode=mode,
            ref_template=ref_template,
            schema_generator=schema_generator,
            **kwargs
        )

Now when other libs call the model_json_schema on your models, it will generate the schema in the way the LLM expects to see it.

Hey, thanks for writing down your thought process and details about your experiments.

I am midway in doing structure entity extraction from unstructured data. Read your post and just wanted to appreciate you writing it down in this post.

Right now I am facing issues with inconsistent results, like for some entity its giving different result each time I run it. Were your later approaches solve the inconsistency problem too?