Table extraction for LangChain and vectorstore

Viellmo · September 26, 2023, 6:36am

Hello, I want to read information from my documents and share it in chat, but my problem is that I have many tables in PDF files. How can I deal with them if I think Chat is bad at reading them? I’m thinking about separating them, maybe you have an idea or have you had the same problem? Downloading them from a PDF file is difficult and they do not have a single structure, each one is different.

Viellmo · September 26, 2023, 6:42am

Tables looks like this (its only half of this one, second part is on next page)

ilianos1 · January 9, 2024, 10:49am

If you’re a programmer, you might want to have a look at pypdf or PyMuPDF.
Here’s a benchmark:

Topic		Replies	Views
Extract the table data from a semi-structured PDF Community gpt-35 , chatgpt	0	793	May 3, 2024
Chat with Multiple PDFs with high accuracy API	2	1892	December 18, 2023
Search long pdf for specific table - possibly need fine tuning model API gpt-4 , fine-tuning , api	10	3157	March 29, 2024
Tabular data for finetuning a model API fine-tuning , pdf	1	1629	December 23, 2023
Document Q&A on technical documents API	2	1562	December 20, 2023

Table extraction for LangChain and vectorstore

Related topics