Hey everyone!
I’ve been trying to automate the conversion of a long pdf to a spreadsheet, and succeeded…as long as there is a prepopulated list of names for the search to use as an initial reference.
But try as I might, I cannot fully automate this first step of name retrieval.
The PDF has a semi-standardized format, but the only thing that indicates a new item is some whitespace and a slightly larger bold maroon font. Items can be of greatly varying length. Having two columns of data seems particularly challenging.
Mini can’t do it at all, (though it’s top-notch at finding something if you already know the name), but “give me a list of names as you find them in this pdf” and all it can seem to find are “Giant [miscellaneous creatures that don’t exist].”
4o’s attempts are “okay,” with only a few errors in the list…but far from safe enough to allow to run by itself. It usually goes off track after the first few entries when even looking for as few as five items.
Tried Everything OpenAI Offers
I’ve exhausted every possibility open AI offers:
- Basic Prompting via a CustomGPT
- Basic Prompting via an Assistant on the Playground with Mini and 4o
- An Assistant with only one document in a Vector Store.
- An Assistant with the score_threshold changed.
- A Vector Store with a smaller chunking strategy, which I thought, in theory, would have greater overlap for the sections in question to help with multiple columns.
- Various permutations of Temperature, Top_P, and Max_Num_Results.
Here’s the python code and the source document if you wanna give it a try.
(More interestingly, the script shows analytics from the search and makes it easy to play with settings and see relevant results.)
Score Threshold is Useful Overall
The new ranking_options are helpful, but they don’t constrain the model to read a PDF as a human does.
- Examining the run steps, then setting a higher score_threshold DIDN’T help pull my list, but it DID significantly reduce prompt_tokens which was very useful for reducing costs when 4o was used to perform the search.
- This made using 4o competitive to perform the search, which it was better at from the start, but prohibitively expensive.
- You can only set a floor for score_threshold. It yields “no results” even if there are results below the floor—which makes sense—but it would be helpful if one could set a range. As it is, you have to examine the search results first, otherwise there’s no way to be sure of the score_threshold being awarded by the ranker.
- In the future, it would be helpful to constrain a model to “reading a document as a human does,” if that’s even possible.
- score_threshold is related to the overall model used to perform the search. 4o found chunks with a higher average threshold than those found by mini under identical circumstances.
It was possible to reduce the search results without affecting their quality, (which wasn’t great), but seems to indicate that the results I’m looking for are successfully being pulled in the first chunk.
Any Ideas What’s Next?
When it comes to automating data extraction from an irregular PDF to a CSV I’m to the point where I think it’s the most valuable to advise folks to create this initial names list manually, or with close super vision.
I think it would be several times faster to read the names off the list while ChatGPT listened, then to go through all of this. The goal is only to automate this process “as much as possible;” and there is something to be said of deliberately including human oversight at this stage.
That said, my next step is to create a small RAG flow and have an Assistant look at the current results then send them back through for correction and see if that helps the final results. Score_Threshold really does help minimize costs.
I am surprised this step is so difficult.
Any thoughts or insights are most welcome!