I am just curious.
Every time I upload pdf files on chat-gpt, I kind of noticed that they either summarize or analyze the content before they process the queries.
I am currently building my own rag model using gpt-4o model, but it seems the performance of my rag is kind of sucks.
The files being uploaded are complex pdfs with many tables. (They are also Korean documents, not English). I first thought it was because of the language difference, but it wasn’t as soon as I tried the RAG of chat-gpt.
Do I have to use gpt vision to extract tables and add middle layer for the summarization?
Anybody has good suggestions or ideas? I just want to copy the current RAG architecture of chat-gpt .
Yes, you will need to extract anything that is not text with your own pipeline. It’s one of the complexities with self built RAG pipelines, even many commercial ones will not make use of tables that are images.
ChatGPT makes use of Microsoft AI Search, you might want to look into that as part of your pipeline.
3 Likes
Thanks for the reply
I mean I can extract the text from the table directly as I upload the pdf file. But it reads the pdf file by columns, not by rows. For example,
| title1 | text1 |
| title2 | text2 |
| title3 | text3 |
| title4 | text4 |
Suppose there is a table looking like this.
If I extract the text directly from the pdf, it will give me a text of
title1
title2
title3
title4
text1
text2
text3
text4
It kind of messes up the data when I chunk it, since the relevant text of title1 is text1.
The only way I can chunk the text of table by
title1 text1\n
title2 text2\n
title3 text3\n
title4 text4\n
is to extract tables in original form by using 3rd party libraries that uses tesseract-ocr, etc right?