I’m here to brainstorm possible solutions for my labeling problem.
I have ~4500 accident reports from paragliding incidents. Reports are unstructured text, some very elaborate over different aspects of the incident over multiple pages, some are just a few lines.
Extract semantically relevant information from the accidents into one unified taxonomy for further analyses of accident causes, etc.
I want to use topic modeling to create a unified taxonomy for all accidents, in which virtually all relevant information of each accident can be captured. The Taxonomy + one accident will then be formed into one API call. After ~4500 API calls, I should end up with all of my accidents represented by a unified taxonomy.
The taxonomy has different categories like weather, pilot experience, conditions of the surface, etc. These main categories are further subdivided, e.g., Weather → Wind → Velocity.
Right now, I am not finished with my taxonomy, but I estimate that it will roughly have 150 parameters to look out for in one accident. I worked on a similar problem a year ago, building a voice assistant with GPT. There, I used Davinci to transform spoken input into a JSON format with predefined JSON actions. This worked decently for most scenarios, but I had to do post-processing of my output because formats weren’t always right, etc.
Currently, my concerns and questions are:
- With many more categories now (150) compared to my voice assistant (14) and a bigger text input (the voice assistant got one sentence, now a whole accident report is up to 8 pages), GPT uses different categories than those defined in the taxonomy, or hallucinates unpredictable.
- How to effectively get structured output (here in the form of a taxonomy) from GPT?
- Would my solution even work as intended?
- Is this a smart way to approach my goal?
- What are alternatives?
For any input and thoughts, I am very grateful. Thanks in advance!