Advanced structured Output -Use case: accident research

LeFlob · March 2, 2024, 3:29pm

Hey everyone!

I already made an initial post to my problem here. I am looking for improvements for my ChatGPT based application. The application receives an accident report and a taxonomy as input.
The peculiarity compared to pure text classification / information extraction is that the taxonomy makes restrictions on the values that ChatGPT may return.
simplyfied Example:

response = client.chat.completions.create(
        model="gpt-3.5-turbo-1106",
        response_format= {"type": "json_object"},
        messages=[
            {
                "role": "system",
                "content": f"{instruction}{taxonomy} "
            },
            {
                "role": "user",
                "content": f"{report}"
            }
        ],
        seed=42,
        temperature=0,
        max_tokens=4095,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )

instruction = ("You are a paraglider safety expert. "
                   "You want to classify accident reports. "
                   "Respond only in JSON format. Only Output attributes that are known."
                   "Use only one attribute per key. "
                   "To classify you may only use the attributes provided in this taxonomy: \n")

{
  "report_as": [
    "pilot",
    "flight_school_flight_instructor",
    "other",
    "authority",
    "witness",
    "passenger",
    "unknown"
  ],
  "flight_type": [
    "cross_country_flight",
    "local_flight",
    "training_flight",
    "assisted_flying_flight_travel_training",
    "competition_flight",
    "passenger_flight",
    "safety_training_flight",
    "unknown"
  ],
  "age": "number"
...

This is only an excerpt. The taxonomy I am using holds 48 elements (1740 tokens)

For data protection reasons I cannot post a real accident report here in the forum, but you can imagine it as a natural language text between 50-1200 words.
My Approach:
This is obviously a difficult task for an LLM. It has to extract information, compare it with the taxonomy, form a valid JSON and do all this for a relatively large schema.
My first approach was to integrate everything into a prompt as shown above and use the JSON mode and the large input token set of gpt-4-turbo-preview and gpt-3.5-turbo-1106.
This led to convincing initial results. The format is correct in almost all cases and the model hallucinates very little.
Problems:

determinism
The model output is not uniform if, for example, I make 5-10 repetitions per accident report. It sometimes differs by up to 4 elements found or not found.
I have already read a lot about this in the forum (e.g. 1, 2, 3) and think that I will have to accept this despite seed, fingerprint and temperature close to or equal to 0.
recall
Unfortunately, the model finds too few elements. Especially something like “report_as” is often only given indirectly. For example, a report is written in the first-person perspective, which is why it is clear that the pilot is also the author of the report. However, this is often not clear to the model. I would like to try to improve this.

What I have tried so far:

langchain

chain = create_tagging_chain(schema, llm)
chain = create_extraction_chain(schema, llm)

I tried both methodes with different schema representations. I represented the taxonomy as a JSON-Schema with and without annotations; and as a Pydantic Object.
Furthermore I tried different strategies from this youtube-tutorial you can test it for yourself in this collab.

Function Calling

Following this article I tried using one general information_extraction function (which essentially is the whole taxonomy) and i tried running multiple functions with different subsections of the taxonomy. (one function for weather_related stuff one for pilot_attributes, etc.)
Side Note: I am aware, that functions are depreciated and are now replaced by tools. I adapted this when experimenting with function calls.

Hyperparamter tuning

I experimented with different temperatures and top_p values as suggested here

Prompt engineering

Obviously i also experimented with different formulations and Chain of thought. few-shot is hardly an option because, the reports are very different and my api calls already are very large

Multi-Prompting

Currently I am trying to split my taxonomy into different sections and make a specialized prompt for them. Similar to multiple function calling, in one API-call the instruction is

instruction = ("You are a renowned paragliding safety expert. "
               "You must search an accident report for information about the harness.
                ...... "

As the taxonomy I only provide the elements regarding the harness.

Results
Interestingly, all previous experiments have led to similar or worse results.
I am surprised by this and wonder whether there is a more promising method for my problem?
My main finding while experimenting is that the different frameworks or methods scale poorly. Often examples are much more rudimentary and free than what I am asking of the model here. When I then apply these methods to my problem, I have the feeling that it simply doesn’t work as well due to the size of the taxonomy and the complexity of the texts.

Long story short:
I am looking for a method to improve my initial prompt or the task I set the LLM. I have given an overview of what I have tried and hope one of the readers here might have an idea for me.

jr.2509 · March 2, 2024, 3:45pm

Hi @LeFlob - to me this looks like it might be a good candidate for a fine-tuned gpt 3.5 turbo model.

Given you’ve already achieved some promising results just by integrating it into a prompt, through the fine-tuning you should be able to address the issues that have surfaced then.

Your training examples would consist of your existing system prompt incl. the taxonomy, the existing user message (i.e. the report) and then your desired output in JSON format.

You could give it a try with maybe just 20-30 examples to see if it could work. Make sure to include those cases where you’ve previously experienced issues. If it works, you can subsequently expand your training data set for even more refined results.

LeFlob · March 2, 2024, 5:13pm

Interesting! I rejected the idea of fine tuning at the beginning, as the effort involved seemed disproportionately high to me. But I would like to try it out, do you happen to have a guide on how to do it? I’m not really familiar with it

jr.2509 · March 2, 2024, 5:15pm

Sure. There’s a couple of resources available:

https://platform.openai.com/docs/guides/fine-tuning

https://platform.openai.com/docs/api-reference/fine-tuning

As said, you don’t want to just rush into creating a huge dataset. I’ve found that you can often test the hypothesis if a task is suitable for fine-tuning with as little as 20-30 examples (minimum is 10 examples).

LeFlob · March 2, 2024, 10:25pm

I´ll try my best keep you guys updated. I spent the last 3 hours getting a training data set and now currently my first fine tuning job is startet.
Will test that tomorrow.

I also checked out your site, lets see if I can improve my Use Case !

Thanks for your reply

azreylg · June 20, 2024, 2:39pm

Hello,
I’m very interested in your Use Case did you find out what worked best for you for forcing the llm to answer in the given format. I’m currently using langgchain with_structured_output with a pydantic v2 schema

LeFlob · June 20, 2024, 4:30pm

Hello,
I have carried out a somewhat larger evaluation. I tried 22 accident reports with 4 prompt engineering methods and 2 function calling methods. In each case the schema was generated via Pydantic. Each accident report was analysed 5 times per method with the same experimental conditions (seed, model, temp, etc.).
Basically, I found that function calling delivered significantly better results. For the implementation, I followed the tips and documentation from Jason Liu, the author of the “Instructor” package. You can also learn a lot from his free WandDB course.

One interesting observation during my tests was that function calling generally had a high precision. Meaning: Entities that were extracted were usually correct. However, the recall was around 40-50%. In other words, only 50-60% of what could theoretically have been found was found by the model.

My assumption was that the schema is simply too large. (My schema had 55 keys and over 170 values). Therefore, in my second approach, I did not create one function for the whole schema, but split the schema into 5 “subfunctions”. For example, one function only queried key values related to the weather, the other only queried pilot-specific entities,…

At first I wanted to do this only with the tools parameter of OpenAI, but in my tests it didn’t work well when I gave gpt3.5 or 4 multiple functions within one call. Therefore I used the library “asynchio”. This allowed me to execute 5 API calls at the same time (I think this also works with batches but for me asynchio was the easier solution).
So I executed a separate API call for each of the 5 functions and merged the JSON strings that came back.
The recall, especially for GPT-4, increased with this approach to over 60% with a precision of 92%. This was the best result of my evaluation.

If I were to improve this further now, I would probably only do function calling. Pydantic and Instructor are great in this context! I defined my Pydantic class in this style:

class extract_accident_info_literals(BaseModel):
    report_as: Optional[Literal["pilot", "flight_school_flight_instructor", "witness", "authority", "passenger", "other", "unknown"]] 
= Field(default= None, 
description="Who is reporting the incident?, e.g. I was doing .... = report_as: pilot")

country: Optional[str] 
= Field(default= None, 
description="Only Coutry code e.g. Chile = CL")

....

So I used Literals and Pydantics Field values. I think especially the description is incredibly important for the LLM.

Hope this helps

nicholishen · July 30, 2024, 2:00pm

Be careful setting defaults because this gives the model an easy-way-out and can make it lazy on the data extraction. It’s best to make it explicitly define the conditions, even if it is not known. It’s also helpful to know that most, if not all of the frameworks that use pydantic are dumping the schema directly from it, and while some remove the redundant title, none of them (that I have seen) resolve $ref and $defs, anyOF, or allOF. This can be problematic since the models are trained on flattend schemas. In other words, BaseModel.model_json_schema does not always produce the schemas in the way they are represented in the training datasets.

To overcome these limitations you can subclass GenerateJsonSchema and override the generate method.

from pydantic.json_schema import GenerateJsonSchema


class ParamsSchemaGenerator(GenerateJsonSchema):
    """
    Custom schema generator that:
        1. Inlines all references for LLM tools (resolves refs/defs).
        2. Reorders keys for LLM optimization.
        3. Removes redundant titles from the schema for token savings.

    Attributes:
        key_order (List[str]): Order of keys for optimization.
        is_reordered_keys (bool): Flag to enable key reordering.
        is_removed_titles (bool): Flag to enable title removal.

    Methods:
        generate(schema, mode="serialization"): Generate the optimized JSON schema.
        _inline_references(schema, definitions): Inline references in the schema.
        _reorder_keys(schema): Reorder keys in the schema.
        _remove_titles(schema): Remove titles from the schema.
    """

    key_order: List[str] = [
        "name",
        "title",
        "type",
        "format",
        "enum",
        "description",
        "properties",
        "required",
        "items",
    ]
    is_reordered_keys: bool = True
    is_removed_titles: bool = True
    _metadata = {}

    @classmethod
    def _add_metadata(cls, **metadata):
        cls._metadata.update(metadata)

    def generate(self, schema, mode="serialization"):
        json_schema = super().generate(schema, mode)
        if "title" in json_schema:
            # json_schema["__name"] = json_schema.pop("title")
            self._add_metadata(name=json_schema.pop("title"))
        if "description" in json_schema:
            # json_schema["__description"] = json_schema.pop("description")
            self._add_metadata(description=json_schema.pop("description"))
        if "$defs" in json_schema:
            definitions = json_schema.pop("$defs")
            json_schema = self._inline_references(json_schema, definitions)
        json_schema = self._inline_all_of(json_schema)
        if self.is_reordered_keys:
            json_schema = self._reorder_keys(json_schema)
        if self.is_removed_titles:
            json_schema = self._remove_titles(json_schema)
        return json_schema

    def _inline_references(self, schema, definitions):
        if isinstance(schema, dict):
            for key, value in list(schema.items()):
                if key == "$ref":
                    ref_key = value.split("/")[-1]
                    schema.update(definitions[ref_key])
                    schema.pop("$ref")
                    return self._inline_references(schema, definitions)
                else:
                    schema[key] = self._inline_references(value, definitions)
        elif isinstance(schema, list):
            return [self._inline_references(item, definitions) for item in schema]
        return schema

    def _reorder_keys(self, schema):
        if not isinstance(schema, dict):
            return schema
        ordered_dict = {k: schema.pop(k) for k in self.key_order if k in schema}
        # Add remaining keys
        ordered_dict.update({k: self._reorder_keys(v) for k, v in schema.items()})
        return {k: self._reorder_keys(v) for k, v in ordered_dict.items()}

    def _remove_titles(self, schema):
        if isinstance(schema, dict):
            new_dict = {}
            for key, value in schema.items():
                if key == "title" and isinstance(value, str):
                    continue  # Skip string titles
                new_dict[key] = self._remove_titles(value)
            return new_dict
        elif isinstance(schema, list):
            return [self._remove_titles(item) for item in schema]
        return schema

    def _inline_all_of(self, schema):
        """Inlines allOf schemas if the allOf list contains only one item."""
        if isinstance(schema, dict):
            if "allOf" in schema and len(schema["allOf"]) == 1:
                # Replace the allOf construct with its single contained schema
                inlined_schema = self._inline_all_of(schema["allOf"][0])
                # If the inlined schema is a dictionary, merge it with the current schema
                if isinstance(inlined_schema, dict):
                    schema.update(inlined_schema)
                    schema.pop("allOf")
                return schema
            # Recursively apply this method to all dictionary values
            for key, value in schema.items():
                schema[key] = self._inline_all_of(value)
        elif isinstance(schema, list):
            # Recursively apply this method to all items in the list
            return [self._inline_all_of(item) for item in schema]
        return schema

Then you can subclass BaseModel and override the model_json_schema

class MyBaseModel(pydantic.BaseModel):
    """Subclass of `pydantic.BaseModel` that provides additional functionality for LLM tool schema generation."""

    @classmethod
    def model_json_schema(cls, schema_generator=ParamsSchemaGenerator, **kwargs):
        mode = 'serialization'
        return super().model_json_schema(
            mode=mode,
            ref_template=ref_template,
            schema_generator=schema_generator,
            **kwargs
        )

Now when other libs call the model_json_schema on your models, it will generate the schema in the way the LLM expects to see it.

dhiraj.suvarna · August 27, 2024, 11:19am

Hey, thanks for writing down your thought process and details about your experiments.

I am midway in doing structure entity extraction from unstructured data. Read your post and just wanted to appreciate you writing it down in this post.

Right now I am facing issues with inconsistent results, like for some entity its giving different result each time I run it. Were your later approaches solve the inconsistency problem too?

Mr.Ruben · September 28, 2024, 6:13pm

You may try this at the end of your prompt (without response_model)

# Output format:

reasoning for reporter: str # Briefly explain your reasoning for your choice of 'incident reporter'.

incident reporter: str #  Who is reporting the incident? It must be one of ["pilot", "flight_school_flight_instructor", "witness", "authority", "passenger", "other", "unknown"]

country where the incident occurred: str # Use 'None' if unknown.

I would create a very brief report, get the output and insert it as an example (using the best model).

You can have multiple examples, of course.

as an image may be better/clearer

# Task description: <something here, simple, brief, clear>

# Example:

## Input:

text of example report


## Output:

reasoning for reporter:

incident reporter:

country where the incident occurred:


# Input:

the real report


# Output format:

reasoning for reporter: str # Briefly explain your reasoning for your choice of 'incident reporter'.

incident reporter: str #  Who is reporting the incident? It must be one of ["pilot", "flight_school_flight_instructor", "witness", "authority", "passenger", "other", "unknown"]

country where the incident occurred: str # Use 'None' if unknown.

The output can be just plain text, which then you throw to another API call with Instructor and response_model.

use message: Just the output of your previous call
response_model: what you already have

Mr.Ruben · March 5, 2025, 11:55am

I had something similar, but just for removing the title. Your modification is much more complete/rounded.

I had some issues using it (maybe I did not use it properly)

So I took the liberty of:

modifying it to resolve the issue with references
added a decorator so one does not need to modify the body of the class (all documented below).
included a dummy example for easy testing

I hope I didn’t break anything.

function + decorator + dummy example

from pydantic.json_schema import GenerateJsonSchema
from pydantic import BaseModel
from typing import List

# from https://community.openai.com/t/advanced-structured-output-use-case-accident-research/663005/9
class ParamsSchemaGenerator(GenerateJsonSchema):
    """
    Custom schema generator that:
        1. Inlines all references for LLM tools (resolves refs/defs).
        2. Reorders keys for LLM optimization.
        3. Removes redundant titles from the schema for token savings.

    Attributes:
        key_order (List[str]): Order of keys for optimization.
        is_reordered_keys (bool): Flag to enable key reordering.
        is_removed_titles (bool): Flag to enable title removal.

    Methods:
        generate(schema, mode="serialization"): Generate the optimized JSON schema.
        _inline_references(schema, definitions): Inline references in the schema.
        _reorder_keys(schema): Reorder keys in the schema.
        _remove_titles(schema): Remove titles from the schema.
    """

    key_order: List[str] = [
        "name",
        "title",
        "type",
        "format",
        "enum",
        "description",
        "properties",
        "required",
        "items",
    ]
    is_reordered_keys: bool = True  # Flag to reorder keys in a more optimal way
    is_removed_titles: bool = True  # Flag to remove redundant titles from schema
    _metadata = {}

    @classmethod
    def _add_metadata(cls, **metadata):
        """
        Updates the metadata dictionary with additional metadata.
        """
        cls._metadata.update(metadata)

    def generate(self, schema, mode="serialization"):
        """
        Generates the optimized JSON schema. It processes the schema by:
        - Removing 'title' and 'description' and storing them as metadata.
        - Inlining references and removing '$defs'.
        - Reordering the keys as per 'key_order'.
        - Removing titles if 'is_removed_titles' is set.
        """
        json_schema = super().generate(schema, mode)  # Call to the parent method for base schema
        if "title" in json_schema:
            self._add_metadata(name=json_schema.pop("title"))  # Store title as 'name' in metadata
        if "description" in json_schema:
            self._add_metadata(description=json_schema.pop("description"))  # Store description in metadata
        if "$defs" in json_schema:
            definitions = json_schema.pop("$defs")  # Pop the '$defs' section to inline references
            json_schema = self._inline_references(json_schema, definitions)  # Inline references
        json_schema = self._inline_all_of(json_schema)  # Handle 'allOf' schemas if they have a single item
        if self.is_reordered_keys:
            json_schema = self._reorder_keys(json_schema)  # Reorder the keys if needed
        if self.is_removed_titles:
            json_schema = self._remove_titles(json_schema)  # Remove titles from the schema if needed
        return json_schema

    def _inline_references(self, schema, definitions):
        """
        Recursively inlines references in the schema, using the definitions provided.
        """
        if isinstance(schema, dict):
            if "$ref" in schema:  # If a reference is found
                ref_key = schema["$ref"].split("/")[-1]  # Extract the reference key
                ref_schema = definitions[ref_key].copy()  # Copy the referenced schema
                schema.pop("$ref")  # Remove the reference key
                ref_schema.update(schema)  # Merge the current schema with the referenced one
                return self._inline_references(ref_schema, definitions)  # Inline recursively
            else:
                return {k: self._inline_references(v, definitions) for k, v in schema.items()}  # Continue recursion
        elif isinstance(schema, list):
            return [self._inline_references(item, definitions) for item in schema]  # Recursively process list items
        return schema

    def _reorder_keys(self, schema):
        """
        Reorders the keys in the schema according to the specified 'key_order'.
        """
        if not isinstance(schema, dict):  # If not a dictionary, return as is
            return schema
        ordered_dict = {k: schema.pop(k) for k in self.key_order if k in schema}  # Reorder known keys
        ordered_dict.update({k: self._reorder_keys(v) for k, v in schema.items()})  # Recurse through remaining keys
        return {k: self._reorder_keys(v) for k, v in ordered_dict.items()}  # Recurse through the ordered dict

    def _remove_titles(self, schema):
        """
        Removes any 'title' keys from the schema to save tokens.
        """
        if isinstance(schema, dict):
            for key in list(schema.keys()):
                if key == "title":  # If the key is 'title', delete it
                    del schema[key]
                else:
                    schema[key] = self._remove_titles(schema[key])  # Recursively remove titles
        elif isinstance(schema, list):
            return [self._remove_titles(item) for item in schema]  # Process each list item
        return schema

    def _inline_all_of(self, schema):
        """
        Inlines allOf schemas if the 'allOf' list contains only one item.
        """
        if isinstance(schema, dict):
            if "allOf" in schema and len(schema["allOf"]) == 1:  # If 'allOf' has only one schema
                inlined_schema = self._inline_all_of(schema["allOf"][0])  # Inline the 'allOf' schema
                if isinstance(inlined_schema, dict):
                    schema.update(inlined_schema)  # Merge the inlined schema with the current schema
                    schema.pop("allOf")  # Remove the 'allOf' key
                return schema
            for key, value in schema.items():
                schema[key] = self._inline_all_of(value)  # Recursively process dictionary values
        elif isinstance(schema, list):
            return [self._inline_all_of(item) for item in schema]  # Recursively process list items
        return schema


# -----------------------
def add_model_json_schema(cls):
    """
    Decorator to add a classmethod 'model_json_schema' to a class. 
    This classmethod uses 'ParamsSchemaGenerator' to generate the schema.
    """
    try:
        # Defines the 'model_json_schema' method for the class
        def model_json_schema(cls, schema_generator=ParamsSchemaGenerator, **kwargs):
            mode = 'serialization'  # Specify serialization mode
            return super(cls, cls).model_json_schema(mode=mode, schema_generator=schema_generator, **kwargs)

        setattr(cls, 'model_json_schema', classmethod(model_json_schema))  # Add as classmethod
        return cls
    except Exception as e:
        print(f"Error adding model_json_schema: {e}")  # Handle any errors
        return cls

# -----------------------
# Dummy class to test the modified model_json_schema functionality
class PresenceReason(BaseModel):
    present: bool
    reason: str

# ------------------------------------
@add_model_json_schema  # Apply the decorator to automatically add 'model_json_schema'
class Output(BaseModel):
    """
    ShotOutput class for dialogue analysis.
    """
    line: str
    first_person_singular: PresenceReason
    first_person_plural: PresenceReason
    second_person_singular: PresenceReason
    second_person_plural: PresenceReason
    third_person_singular: PresenceReason
    third_person_plural: PresenceReason

    # # Without the decorator we would need to add this
    # @classmethod
    # def model_json_schema(cls, schema_generator=ParamsSchemaGenerator, **kwargs):
    #     mode = 'serialization'
    #     return super().model_json_schema(
    #         mode=mode,
    #         schema_generator=schema_generator,
    #         **kwargs
    #     )




# Test the functionality of the 'model_json_schema' method
Output.model_json_schema()

Output after modification and decorator

{
   'type': 'object',
   'properties': {
      'line': {'type': 'string'},
      'first_person_singular': {
         'type': 'object',
         'properties': {
            'present': {'type': 'boolean'},
            'reason': {'type': 'string'}
         },
         'required': ['present', 'reason']
      },
      'first_person_plural': {
         'type': 'object',
         'properties': {
            'present': {'type': 'boolean'},
            'reason': {'type': 'string'}
         },
         'required': ['present', 'reason']
      },
      'second_person_singular': {
         'type': 'object',
         'properties': {
            'present': {'type': 'boolean'},
            'reason': {'type': 'string'}
         },
         'required': ['present', 'reason']
      },
      'second_person_plural': {
         'type': 'object',
         'properties': {
            'present': {'type': 'boolean'},
            'reason': {'type': 'string'}
         },
         'required': ['present', 'reason']
      },
      'third_person_singular': {
         'type': 'object',
         'properties': {
            'present': {'type': 'boolean'},
            'reason': {'type': 'string'}
         },
         'required': ['present', 'reason']
      },
      'third_person_plural': {
         'type': 'object',
         'properties': {
            'present': {'type': 'boolean'},
            'reason': {'type': 'string'}
         },
         'required': ['present', 'reason']
      }
   },
   'required': [
      'line',
      'first_person_singular',
      'first_person_plural',
      'second_person_singular',
      'second_person_plural',
      'third_person_singular',
      'third_person_plural'
   ]
}

another example

One can clearly see that it looks more readable/intuitive than the raw output from Pydantic, which surely helps the LLM to follow it.

Topic		Replies	Views
Structured Outputs Deep-dive API api , structured-output	42	13301	April 30, 2025
Structured Outputs not reliable with GPT-4o-mini and GPT-4o API structured-output	38	7306	January 23, 2025
Why aren’t we talking more about attribute-specific prompting in Structured Outputs? API json-mode , structured-output	10	629	August 19, 2024
Structured Response: enums not supported in with Pydantic schema generation Bugs	13	2309	September 20, 2024
Json schema for two level category API	7	1955	August 13, 2024

Advanced structured Output -Use case: accident research

function + decorator + dummy example

Output after modification and decorator

Related topics