Best way to process PDF File that has over 100k lines?

Hello, I was looking into building a project that allow to investigate long PDF files and I run into a blockage.

The PDF is very long ~100k of text lines.
I chunk the data but it’s not helping as each time I call request I either need to provide all the data for API to know the context.

So the idea would be such - CHAT GPT API would be aware of full context of the PDF drive.
And the user could ask questions related to the the pdf.

So the problem is now that PDF is so long that it won’t fit to max allowed token count.

I have tried to use LangChain framework and it kinda worked, but the requests became very expensive I think I made 3 requests and the costs were upto 0.60$. (Not doable).

What would be other approach that I could use?

I was looking into ‘embeddings’ but failed to understand how I should apply them. But I think it might help with this LONG context problem and with such high costs.

Any help would be amazing…

1 Like

There are pre made system out there with names like Chat with PDF and PDF chat built with langchain and embeddings. So, unless you wish to for learning purposes there is no need to get to grips with embeddings.

Embeddings are essentially a way of storing large amounts of text in a database that groups things that are similar next to each other, that way the system can pull back data that is “similar” to your question and then use that as context for the AI to work from.

Thank you for your response.

The need is to have API around it as the data will be needed future to generate content within CMS from data in the pdf for certain topics.

I have tried LangChain framework, and it works, however the API call’s becomes very slow and can take up to 5min to await result of the query.

Would it be faster to use Embeddings API / is it like data being stored then within OpenAI or it’s something I should store within my database and then I would just include when performing requests to Chat API?

The topic of embeddings is a large one, to give you a head start, I suggest checking out the documentation here OpenAI Platform and perhaps a few videos, a good one here: OpenAI's New GPT 3.5 Embedding Model for Semantic Search - YouTube

1 Like

LlamaIndex is purpose built for this.

Use a programming language like Python with libraries such as PyPDF2 or PyMuPDF for efficient PDF parsing and processing. Implement a streaming approach to handle the large file, reading and processing lines one at a time to optimize memory usage.