Best practice to store PDF Tables into Database and Use Open AI to answer to questions

Hi Guys,
I’m working on a workflow where I use OpenAI models to read PDF documents and generate answers for a predefined set of questions.

My current approach is:

  1. Extract the PDF content

  2. Split the content into multiple chunks

  3. Store these chunks in a database

  4. Retrieve relevant chunks as context and pass them to OpenAI for answer generation

This works reasonably well for plain text. However, I’m running into challenges when the PDF contains multiple tables (sometimes complex or spanning multiple pages).

My question is:

What is the best way to extract, represent, and store tables from PDFs as chunks so that LLMs can easily understand and reason over them during context retrieval?

When you say tables, do you mean images of tables?

Normal tables in PDF with multiple columns…

Hmmm, not sure how to handle this because non-image tables in PDFs often look like tables but are fundamentally just graphical elements (text and lines) that mimic a table’s appearance, not a true data structure.