Best Approach to Extract Key Data from a Structured PDF with LLM

I’m looking to extract highly specific information from a three-page PDF document. Specifically, I want to retrieve six key data points: product name, manufacturer, contained substances, legal mentions, etc.

The document is divided into four distinct sections, each addressing a different topic, and the information I need is scattered across the entire text.

I’m trying to identify the most efficient strategy to obtain answers that are precise, coherent, concise, and especially free of hallucinations.

Here are the options I’m considering:

  1. Send the full text in a single prompt to the LLM. Each section is about 1000 tokens, so the total fits within a 4000-token context window.

  2. Embed the full document and then ask a separate question.

  3. Embed each section separately, then query the model either once or up to six times (one per item of interest).

  4. If the total data size exceeds 3000–4000 tokens, a retrieval-augmented generation (RAG) approach would be required. In that case, what chunking strategy would you recommend? And should I prompt with one general question or split it into six targeted ones?

:backhand_index_pointing_right: I’d also really appreciate it if you could share the strengths and weaknesses of each approach, especially in terms of inference cost (compute, time) and risk of hallucination.

1 Like

Hope you’ve seen this Announcements

3 Likes

If the document content is easily available as text and only 3 pages I would include the full text in the prompt (with start/end indicator). That should be absolutely no problem to get the right answers.
The mentioned solution is very cool but if you have simple pdf’s extracting text yourself and adding to the prompt is much cheaper token wise.

2 Likes

Convierte el convierte el PDF a word
Y extrae la información más fácil en word

I love PDF.
Puedes hacerlo

Except the issue is data extraction, manipulation and visualization. The visual component translates most seamlessly. Unfortunately, OCR while great is not an effective solution to do high level data analytics and simplified graphical renderings.