If you want literal ‘word counts’ there are great tools for that (convert the PDF to text and use simply python). If you look up the ‘strawberry problem’ you will understand the challenge with your ask in the example. It could only come up with the literal answer correctly probably by letting it do code execution and do as I described above)
Now if this was an actual report text and you would ask it to create a list of the companies covered and which ones are mentioned the most you’d probably get a pretty good results. But again - if you judge the answer by the literal mention of the company name, you’d be better of with coding.
Lastly, your results will differ greatly based on the model you’re using.
Maybe you want to share a real document that you are working on and we see about that?
I understand your point, and I’m aware of the “strawberry problem.” However, I don’t always know what kind of question the user will ask. If the user wants to ask about a word count, they have the right to, and the model should at least attempt to provide an accurate response.
For your information, even when I convert the PDF into plain text, the word counting is often still incorrect. This raises the question of whether PDFs are really the best format for these types of tasks. If not, perhaps I should explicitly include a note in the prompt that if the model sees the word “count” in the question, it should either:
Clearly state that it’s not capable of providing a reliable count, or
Provide a step-by-step procedure for the user to achieve it themselves (e.g., using Python).
Also, as you mentioned, even with a simple PDF like the one I showed earlier (the one with “TESLA” mentioned multiple times), the model struggles to transcribe the content accurately without missing a single instance of “TESLA.” So even when I ask it to write the full content of a small PDF, it doesn’t always succeed.
I think addressing this limitation clearly in the model’s responses would help manage expectations. Let me know what you think!
Currently there are too many constraint to ‘solve’ this universally and I’m afraid that ‘AGI’ for this is not around the corner. The difference between a small and a big PDF for example determine if the current code completion can handle the task of extracting the text from the PDF and doing a Python count. Big PDF: won’t work because the runtime of the code completion is limited. Context window is a big challenge especially for these types of tasks, that are generally ‘easy’ to solve with traditional coding./query.
You could also consider providing tools for the model (a word count tool) that would work directly on the document - but it might FEEL like all those things defeat the purpose of having a smart assistant.
Google Gemini is making good progress with things like google docs integration - it would be fun to try it - and have those questions answered (from inside google docst). Select a PDF in google drive and talk to gemini about it. So far I have seen pretty good results. (And it has a very large context window).
I wouldn’t say that. You have no idea what is going on on the numerous computers around the world.
I would say the solution for that will come from a single person and not a company.
I mean that I think ‘AGI’ is a harder concept than ‘human smart’. If with AGI you mean that the model will always be able to figure out a way ‘to get the job done’ then yes i can see that. BUT then I am sure (as we see a lot on here as well), the question of price / tokens will come up. Handling a 100 or 1000 page pdf will always be a different type of ‘processing / context / memory’ job then a 1 or 10 page PDF. AGI or not.
When you cut the pdf into single pages and then chunk it into overlapping parts and then use a llm on it to extract data you will always have problems.
Where when you split it up in multiple dimension of analysis e.g. create a paragraph->sentence->word tree, parse out tables e.g. by spatially grouping words that share a common area and you do analysis with word clouds, ColBERT, abstraction, grammar trees, etc. and you use algorithms to create multiple subgraphs in a graphdb you got other possibilities / tools for the GPT to use than just programming.
Using a mix of programatic extraction e.g. in PHP code look for keywords like class or function to create a dependency graph… there are so many ways to extract data.
Training special models automatically when the extracted sub graph was labeled successful multiple times. Even Agents that could be finetuned by a user claiming the data was wrongly extracted with a mechanism that creates a new agent. There are so many techniques to combine.
Counting stuff from fuzzy to semantically in pdf is definetely solvable in general.
Not with a couple calls to llm though.
I feel that the issue is not just about counting it seems that I am not reading the entire PDF.
For example, when I use the API to display the full content of a very small PDF (like the one in the screenshot I sent ), it fails to include all the data (not all the time sometimes works good)
Currently, I convert the PDF to images. Do you think there is a better way to read the PDF, including images and graphs, to ensure that I capture and send all the data accurately?
Convert to image is probably the worst way to do it. Because then the image will be your input which is first of course super expensive in terms of tokens but also probably will loose text here and there because of resolution. And of course each page de facto becomes its own section. So really the way to go is the opposite. Get full text of the document first.
I don’t think it’s the worst way because Vision ChatGPT is recommended by OpenAI. I have already tested extracting only the text, but I also want to include all graphs and images. Maybe I will try Marker, MinerU, or other alternatives.
When you create such a structure you can find similar “keys” or “Extracted Data Types” and count them.
You can do topic mining and group them.
You can do fuzzy search on the nodes.
This is a lot easier for the model - since you don’t ask it somehthing like “give me all the data at onces in the following structure” but instead you chunk it on multiple levels and you can compare the chunk evaluations by comparing the subgraphs similarities.
Vision model alone will not extract everything. It is not reliable for data extraction - just keep that in mind. It is not usable in critical apps e.g. when you want to analyse a document to give suggestions on medication for a patient. People will die then!
Or when you use it for invoice data extraction: companies will pay too much!
Or when you use it for CV data extraction: people will not get hired even when they are the best qualified candidate
Or when you use it for summaries in science: someone will use it when they are planning some autonomous weapon construction and that thing will then go after you - just because you made a bad data extraction
Consider playing with a tool like LlamaParse first. That will get you markdown text from any document and depending on the settings will convert charts etc
As well. Then you feed the full text version of the document and you will see much better results.
Going to open source a full local GraphRAG solution with a microservice architecture including rabbitmq, postgres, pgvector, postgis, neo4j and preconfigured apis in PHP/Symfony and Python in a few days…
Just wait for it…
you will be able to just clone it and run it with
make prepare
(obviously it will ask for a couple credentials, api keys, deployment ids,… )
which creates the whole infrastructure…
Also has minio included so you can send documents/files/mesages… to an incoming bucket and it fills the graph…
It is basically an advanced assistant with unlimited memory
For now it just enables to use openai models from their api and from your azure deployments. but it is really modular and uses a strategy pattern in case you want tl use other stuff… the modules just need to implement a certain interface
As you can imagine such a system required years of sitting in front of the code so it autoconfigures everything.
I can tell my arms and even fingers hurt from typing…