Hey everyone!
I already made an initial post to my problem here. I am looking for improvements for my ChatGPT based application. The application receives an accident report and a taxonomy as input.
The peculiarity compared to pure text classification / information extraction is that the taxonomy makes restrictions on the values that ChatGPT may return.
simplyfied Example:
response = client.chat.completions.create(
model="gpt-3.5-turbo-1106",
response_format= {"type": "json_object"},
messages=[
{
"role": "system",
"content": f"{instruction}{taxonomy} "
},
{
"role": "user",
"content": f"{report}"
}
],
seed=42,
temperature=0,
max_tokens=4095,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)
instruction = ("You are a paraglider safety expert. "
"You want to classify accident reports. "
"Respond only in JSON format. Only Output attributes that are known."
"Use only one attribute per key. "
"To classify you may only use the attributes provided in this taxonomy: \n")
{
"report_as": [
"pilot",
"flight_school_flight_instructor",
"other",
"authority",
"witness",
"passenger",
"unknown"
],
"flight_type": [
"cross_country_flight",
"local_flight",
"training_flight",
"assisted_flying_flight_travel_training",
"competition_flight",
"passenger_flight",
"safety_training_flight",
"unknown"
],
"age": "number"
...
This is only an excerpt. The taxonomy I am using holds 48 elements (1740 tokens)
For data protection reasons I cannot post a real accident report here in the forum, but you can imagine it as a natural language text between 50-1200 words.
My Approach:
This is obviously a difficult task for an LLM. It has to extract information, compare it with the taxonomy, form a valid JSON and do all this for a relatively large schema.
My first approach was to integrate everything into a prompt as shown above and use the JSON mode and the large input token set of gpt-4-turbo-preview and gpt-3.5-turbo-1106.
This led to convincing initial results. The format is correct in almost all cases and the model hallucinates very little.
Problems:
- determinism
The model output is not uniform if, for example, I make 5-10 repetitions per accident report. It sometimes differs by up to 4 elements found or not found.
I have already read a lot about this in the forum (e.g. 1, 2, 3) and think that I will have to accept this despite seed, fingerprint and temperature close to or equal to 0. - recall
Unfortunately, the model finds too few elements. Especially something like “report_as” is often only given indirectly. For example, a report is written in the first-person perspective, which is why it is clear that the pilot is also the author of the report. However, this is often not clear to the model. I would like to try to improve this.
What I have tried so far:
- langchain
-
chain = create_tagging_chain(schema, llm)
-
chain = create_extraction_chain(schema, llm)
I tried both methodes with different schema representations. I represented the taxonomy as a JSON-Schema with and without annotations; and as a Pydantic Object.
Furthermore I tried different strategies from this youtube-tutorial you can test it for yourself in this collab.
- Function Calling
Following this article I tried using one general information_extraction function (which essentially is the whole taxonomy) and i tried running multiple functions with different subsections of the taxonomy. (one function for weather_related stuff one for pilot_attributes, etc.)
Side Note: I am aware, that functions are depreciated and are now replaced by tools. I adapted this when experimenting with function calls.
- Hyperparamter tuning
I experimented with different temperatures and top_p values as suggested here
- Prompt engineering
Obviously i also experimented with different formulations and Chain of thought. few-shot is hardly an option because, the reports are very different and my api calls already are very large
- Multi-Prompting
Currently I am trying to split my taxonomy into different sections and make a specialized prompt for them. Similar to multiple function calling, in one API-call the instruction is
instruction = ("You are a renowned paragliding safety expert. "
"You must search an accident report for information about the harness.
...... "
As the taxonomy I only provide the elements regarding the harness.
Results
Interestingly, all previous experiments have led to similar or worse results.
I am surprised by this and wonder whether there is a more promising method for my problem?
My main finding while experimenting is that the different frameworks or methods scale poorly. Often examples are much more rudimentary and free than what I am asking of the model here. When I then apply these methods to my problem, I have the feeling that it simply doesn’t work as well due to the size of the taxonomy and the complexity of the texts.
Long story short:
I am looking for a method to improve my initial prompt or the task I set the LLM. I have given an overview of what I have tried and hope one of the readers here might have an idea for me.