Here is an alternate way to do this that you could try, which is how I would have approached this.
From what you said, your issue isn’t about getting the correct data, but your issue is about how it how it gives you a structured response.
Now one way you could achieve this is by using the basic API and fine tuning how you would like to output to be structured.
I would recommend:
Start with few shot (content from a few PDF)
Run it through a Series of prompts structured prompts as each API
If your response starts to improve, but you need to teach it from a larger corpus, create a JSONL file with at least 250 items and train it
It seems like a lot of work, but I don’t think GPTs can do this just because you have a huge amount of data.