Best Approach to Extract Key Data from a Structured PDF with LLM

ahmedbhs123 · April 10, 2025, 7:42am

I’m looking to extract highly specific information from a three-page PDF document. Specifically, I want to retrieve six key data points: product name, manufacturer, contained substances, legal mentions, etc.

The document is divided into four distinct sections, each addressing a different topic, and the information I need is scattered across the entire text.

I’m trying to identify the most efficient strategy to obtain answers that are precise, coherent, concise, and especially free of hallucinations.

Here are the options I’m considering:

Send the full text in a single prompt to the LLM. Each section is about 1000 tokens, so the total fits within a 4000-token context window.
Embed the full document and then ask a separate question.
Embed each section separately, then query the model either once or up to six times (one per item of interest).
If the total data size exceeds 3000–4000 tokens, a retrieval-augmented generation (RAG) approach would be required. In that case, what chunking strategy would you recommend? And should I prompt with one general question or split it into six targeted ones?

I’d also really appreciate it if you could share the strengths and weaknesses of each approach, especially in terms of inference cost (compute, time) and risk of hallucination.

merefield · April 10, 2025, 8:05am

Hope you’ve seen this Announcements

jlvanhulst · April 10, 2025, 11:07am

If the document content is easily available as text and only 3 pages I would include the full text in the prompt (with start/end indicator). That should be absolutely no problem to get the right answers.
The mentioned solution is very cool but if you have simple pdf’s extracting text yourself and adding to the prompt is much cheaper token wise.

rgnova · April 10, 2025, 10:33pm

Convierte el convierte el PDF a word
Y extrae la información más fácil en word

I love PDF.
Puedes hacerlo

ajancz2 · April 11, 2025, 1:07am

Except the issue is data extraction, manipulation and visualization. The visual component translates most seamlessly. Unfortunately, OCR while great is not an effective solution to do high level data analytics and simplified graphical renderings.

Topic		Replies	Views
Retrieval Augmented Generation (RAG) with 100k PDFs?! Too slow! Community pdf , llm , rag , development	13	28546	October 31, 2024
Trainining based on complex text API gpt-4 , chatgpt , api	8	1766	July 5, 2023
Best approach for extracting data from diverse invoice PDFs using OpenAI - Seeking guidance on model selection and training strategy API	6	3430	November 4, 2024
Help with PDF-Based Chatbot and hallucination issues Feedback api , pdf , chatbot , assistants-api	6	1632	August 28, 2024
Efficiently Interacting with super super Long PDFs/documents API gpt-4	2	1570	June 25, 2024

Best Approach to Extract Key Data from a Structured PDF with LLM

Related topics