What is the current rag architecture of openai for pdf uploads?

inyeob2244 · July 24, 2024, 8:50am

I am just curious.

Every time I upload pdf files on chat-gpt, I kind of noticed that they either summarize or analyze the content before they process the queries.

I am currently building my own rag model using gpt-4o model, but it seems the performance of my rag is kind of sucks.

The files being uploaded are complex pdfs with many tables. (They are also Korean documents, not English). I first thought it was because of the language difference, but it wasn’t as soon as I tried the RAG of chat-gpt.

Do I have to use gpt vision to extract tables and add middle layer for the summarization?

Anybody has good suggestions or ideas? I just want to copy the current RAG architecture of chat-gpt .

Foxalabs · July 24, 2024, 8:59am

Yes, you will need to extract anything that is not text with your own pipeline. It’s one of the complexities with self built RAG pipelines, even many commercial ones will not make use of tables that are images.

ChatGPT makes use of Microsoft AI Search, you might want to look into that as part of your pipeline.

inyeob2244 · July 24, 2024, 9:12am

Thanks for the reply

I mean I can extract the text from the table directly as I upload the pdf file. But it reads the pdf file by columns, not by rows. For example,

Suppose there is a table looking like this.
If I extract the text directly from the pdf, it will give me a text of

title1
title2
title3
title4
text1
text2
text3
text4

It kind of messes up the data when I chunk it, since the relevant text of title1 is text1.

The only way I can chunk the text of table by

title1 text1\n
title2 text2\n
title3 text3\n
title4 text4\n

is to extract tables in original form by using 3rd party libraries that uses tesseract-ocr, etc right?

Topic		Replies	Views
Problem with doing RAG with 300k pages of PDFs Community gpt-4 , gpt-35-turbo , api	8	6041	March 7, 2024
Process scanned pdfs through api API gpt-4 , chatgpt , api , pdf , ocr	2	1228	December 12, 2024
Using large PDFs to make a ChatBot API chatgpt , api	21	6670	December 15, 2023
What is the best way to chunk a PDF file for RAG in a smart way that preserves the meaning during retrieval? API chatgpt , rag	5	17059	October 28, 2024
Understanding the Algorithm Behind ChatGPT's Custom GPTs and Improving RAG Accuracy API chatgpt	2	833	January 13, 2025

What is the current rag architecture of openai for pdf uploads?

The only way I can chunk the text of table by

Related topics