Questions Answering based on the PDF

Hi everyone,

I’m working on a project where I need to answer about 150 predefined questions using information from a 200‑page PDF.

I’ve already parsed the entire PDF and stored the extracted text as chunks in a database. Now I’m looking for guidance on the most efficient and optimized way to pass all 150 questions to the model so it can answer them accurately based on the PDF content.

Has anyone implemented something similar or found a good pattern for:

Handling a large number of questions efficiently

Structuring prompts or batching queries

Reducing token usage while keeping answers accurate

Any best practices for retrieval‑augmented setups with long documents

Any suggestions or recommended approaches would be greatly appreciated.

Thanks in advance!

1 Like

The optimal approach here is to use a retrieval-augmented generation (RAG) pipeline rather than sending the entire PDF or all questions in a single prompt. Embed your PDF chunks into a vector store, then process the 150 questions individually (or in small parallel batches) by retrieving only the top-k most relevant chunks per question and passing those, along with the question, to the model. This minimizes token usage, improves retrieval precision, and yields more accurate answers. Sending all questions or large portions of the document at once generally increases cost and degrades answer quality, while RAG is the standard, production-grade solution for this scenario.

Problem is with RAG is the context loss. When I am passing entire PDF there is some flow of knowledge.. In RAG it will pick few chunk based on the semantic similarity or token search but it might loss some context from the previous chunk.

Use overlapping chunks then. Although there are other ways like spatial grouping to extract table data and don’t forget regular expression.

hey bot make me a regex that finds all domains in a text.

And there is NER extraction (basically small models trained on labled usergroup data to find cities, names, etc)

So for each question use an orchestrator that selects the perfect tool combination.
There are also images and I have even seen background text (which made it a captcha for ocr) and symbols on CV…
e.g. skills:

foo *****
bar ***

but not asterisks but some individual symbols/icons lol
And don’t forget to watch out for prompt injection in documents … sometimes background and font color are the same even.

And for some specific data e.g. floor plans inside pdf you may even want to train own CNN…

In the end the best possible outcome seems to be reached by filling ontologies for multiple domains and a general ontology for spatial and temporal relation.
Then use a graphrag and a diffusion model to create the prompt for the llm :sweat_smile:

2 Likes