Seeking Advice: Uploading Large PDFs for Analysis with GPT-3 API

izaias.sferreira · August 18, 2023, 6:12pm

Hello there,

I came across the website “chatpdf .com” and noticed its capability to upload PDF files, allowing the GPT model to analyze content, retain information, and provide answers to queries regarding the documents.

In my attempt to replicate this functionality, I tried transcribing PDFs and uploading the transcriptions to the GPT-3 API using both Chat Completions and Completions. I divided the documents into segments to avoid exceeding the token limit per message (using Chat Completions). However, when I attempted to upload a substantial book, I encountered an error.

As I am posing this question on the forum, I currently do not have access to my code, and I am unable to recall its current status. After some research, I realized that I might have been uploading too much text at once. It seems there is a limitation on the number of tokens per context. This led me to wonder how the mentioned website manages to upload a significant amount of information without surpassing the token limit within a conversation.

I have been grappling with this issue for quite some time and have not come across any definitive solutions. Would the optimal approach involve fine-tuning? Honestly, I’m unsure. Could someone provide guidance on this matter? I would greatly appreciate any assistance.

novaphil · August 18, 2023, 6:15pm

“Chat over documents” is generally handled by splitting the document into chunks, creating embeddings on the chunks, and storing them in a vector database. Then on a user query, create embedding of the query, perform a similarity search in the vector database to retrieve the chunks, then put those relevant chunks in the prompt. There’s a bunch of tutorials on YouTube and elsewhere on how to do this. LangChain is likely the easiest way to get started with a proof-of-concept (although wouldn’t recommend using it for a large-scale production app).

noor.khalifa · October 17, 2023, 6:58am

I totally understand this logic. However, what if a company wants to chat with their documents but do not want to store text along with the embeddings in the vector database like Pinecone? So basically, only vectors of text will be stored in Pinecone.

What is the proper approach for this use-case now that text cannot be embedded in the prompt?

Foxalabs · October 17, 2023, 8:08am

Well, you need the text to be “somewhere”. I’m assuming you wish to keep the text itself on your internally secure databases, in which case you would make an index lookup table for your data and embed that index’s value along with the vector, you can then lookup that index locally and recreate which text block created that vector.

noor.khalifa · October 17, 2023, 10:15am

so in the end in any case, the LLM must receive the text of relevant docs in the prompt so that it forms its answer, right?

Foxalabs · October 17, 2023, 10:19am

Yes, the models are “stateless” so they start every API call with no knowledge of the prior conversation. Think of starting a conversation with a new stranger every time you speak, the only way you get them to understand things you’ve said in the past is to tell them everything each time, same with the GPT language models.

sergeliatko · October 17, 2023, 2:17pm

You can use a private instance of Weaviate database on your servers to securely store the data.

Topic		Replies	Views
Sending large document via API call and asking for a question over complete document? Prompting api	3	1795	February 26, 2024
Is there any way by which I can let GPT-4 API summarize large PDF texts? API gpt-4 , api	10	11446	May 6, 2024
Retrieval Augmented Generation (RAG) with 100k PDFs?! Too slow! Community pdf , llm , rag , development	13	24989	October 31, 2024
Use case: asking questions about a specific document API	7	2356	June 12, 2023
Answering questions about text file content API	5	9020	December 15, 2023

Seeking Advice: Uploading Large PDFs for Analysis with GPT-3 API

Related topics