How to read large files using OPENAI API?

I am building a chatbot using OpenAI API Key. I have used gpt-3.5-turbo-16k engine there. I want to read large files such as, pdf, word, excel, csv, text files. I tried uploading a large pdf file of nearly 8 MB, it could not take. Then I compressed that file and again uploaded, it was able to be uploaded but when I start asking questions to it, it could not able to respond due to exceed of token limit. How to deal with the same?

You need to read about the AI models’ context window length limitation.

This understanding is fundamental to programming an AI and understanding why you can’t do what you describe.

so you are saying one can no way able to read large files through a chatbot?

Hey, unless it’s you who built the bot and used a database to retrieve shorter pieces of context from documents to answer user requests…

You have no choice but chunk your large document. You can give it about a page that is it. A lot of ideas how to do this

You can talk to the big vector analysis guys
Or find a way to word key map the section of the document being asked about

Until OpenAI starts doing this themselves it is on us to do it

I have been happy using word mapping for my use but it doesn’t lend it self to gigantic documents

I suspect the vector analysis chunking works but i wasn’t impressed with using it

I still think key-word mapping is the key :key: I just am not sure how yet

1 Like

any idea how to do it like dividing large files into multiple small chunks and apply vector embeddings to retrieve only relevant information?

Chunks you need to split it up somehow. There are a few options you can do a vector analysis and find the relevant text that way

Or you can keyword map them

I suspect there are others

But most just cut it down and stay under the 8192 token limit or pay extra for the 16K engine

Still that is not enough for larger documents

Chunking is your answer this five minutes they are expanding the input every day and is suspect theirs on the chat is 32K