I have a very huge PDF document ,of which, when I extract the contents using PyPDF library and send it over the chatCompletion API in GPT 4.0, I get ‘Tokens exceeded’ error. I think max token size allowed is around 8000 tokens approximately, in the type of the GPT 4.0 that I am using.
Is there a way, if I break this document into multiple chunks, to send these chunks in a sequence through API, so that the model can remember the complete document ,at the end, and then I can ask a particular question based on this complete document through ChatCompletion API?
P.S. I am new to this technology and these APIs. So kindly excuse me if this seems to be a basic question.
Hi @shripati007 - welcome! Before looking at more complex solutions, is anything holding you back from using the GPT-4-turbo model, which has a context window of 128k (equivalent to 300+ pages)? Given you are already using GPT-4, this would likely be the easiest solution.
As regards your idea with chunks: the API would not be able to remember the individual pieces, so every call would be treated separately. However, there are (likely) still ways for you to generate an answer through this iterative mechanism in the event that you can’t switch to GPT-4-turbo.
Hi @jr.2509 - Thanks for your useful response. The thing is we would soon receive the access for the GPT Turbo with 128k tokens, that you mentioned, and it might largely address the current document size that I have. But the challenge is, the size of this document is expected to grow every quarter, with incremental addition of new content. So I was looking for a long term and viable solution which can help in scaling this growing document size over a period of time. If you have any such solution in mind, please let me know. Thanks again!
Using embeddings: you could chunk the document and then create an embedding for each chunk and then use embeddings-based search to obtain an answer to the question… If the base document stays the same, then you could just create embeddings for the incremental additions that you add to your existing embeddings. If the whole document changes, you’d have to re-create the embeddings over and over again.
The chunking approach whereby you feed a chunk of the document as part of the prompt along with the question. You then keep the answer as a variable and feed it into the next API call as part of the prompt, asking the model to refine the answer based on the new information (i.e. your next chunk). You repeat that process with all your chunks and in the end you should have an answer that has considered all the information. There’s a risk that you may lose some context, but that somewhat depends on the nature of your information and questions.