Using large PDFs to make a ChatBot

With the pdf being that big, the embeddings option is still prob the best. Can you give a bit of rundown how did you process them and what was the matching criterion ?


Are there also learning paths included or just courses?

If only courses you should split the PDFs in single pages, then create some overlap (1/3 of the page maybe) to find the end and beginning of a course in the first step.

Then embedd the courses one after the other.

Although I think there are better ways. Elastic search, course classification and a long prompt with the course categories and meta data of courses.

And then take the courses that came through a first GPT classification and you go deeper on that data with sub classifications. Speaking of knowledge graphs.

1 Like

I turn pdf into text and then chunk it up page by page. Found some great video while browsing other post in this forum. If you search “ GPT-4 Tutorial: How to Chat With Multiple PDF Files (~1000 pages of Tesla’s 10-K Annual Reports)” on YouTube you should be able to find it (sorry, cannot post links yet).

1 Like

This is whole lot of new information to try out. Thank you!

There is in fact learning paths (I call them sample plan of study). These plans are tables in the pdf, but when the document is turned into text it just all goes sideways (doesn’t help that the original pdf has 2 columns). Why would you recommend in this scenario?

You propose some great ideas, I will test them out as soon as I can.

1 Like

Use ghostscript to make tiff from pdf, then tesseract to create hocr from tiff.

Read about hocr.

1 Like

Looks very promising, thank you.

And your initial message implied that you would do something else differently with sample paths included?

Working in that for a couple months.
So yay, differently. A little :sweat_smile:

1 Like

Hey, did everything you described. I now have 1500 separate .hocr files. What do I do with them? Convert to text?

1 Like

Run them through GPT to identify the end of a course.

Hocr is a good format for gpt to understand.

You may try hocr2yaml as well to save token.

1 Like

Sorry to ask too many questions. But I believe this method will not work for me. The problem is that there are two columns, and alphabetical output turns into a mess. Here is the example:

Ancient Near Eastern and Egyptian Studies, BA… 284
Mathematics, PAD eee SIS
Ancient Near Eastern and Egyptian Studies, Minor … 281
Statistics, BA eee net JIG
Classical Studies, MA… 2E2
Statistics, BS SIE
Classics, BA QE
Statistics, MINCK SIE
Classics, MING? 2ES
Statistics, MS BIB
Cognitive Science 2A
Modern Languages and Literatures STD
Cognitive Linguistics, MA BBS
Chinese, BA B24
Cognitive Science, BA BBD

Maybe I’m just bad, but I think either the OCR is messing up, or I’m prompting GPT incorrectly. Also the files are huge, most exceed 16k tokens and more than half exceed 32k (just for input).

Found your post here API for Advanced Data Analytics - #2 by jochenschultz . Will try it out.

1 Like


I am working on a project that requires a large pdf to be processed and exact data to be retained. I want to take a course catalog (1500 pages) and turn in a chatbot that you can ask questions related to courses. Currently my problem is that with a large amount of pages the answers tend to be inaccurate.

Hey! Trying to do something similar, how did you end up solving this problem? Would appreciate a quick rundown of what worked for you :slight_smile:

Hey. Ended up scraping the webpages instead, since it was much easier to work with tabular data. After successfully preparing the data, I just used RAG. Let me know if you have any more questions.

Got you. In my case I’m dealing with a few pdfs of around 500 pages each, the thing is that while its mostly text, some parts of it contain important information in the form of tables and/or images. How do i make RAG effective in this case?

For what I’m trying to do i gotta make sure that the retrieval is super efficient, would you have any advice? :slight_smile:

I am also attempting to use the Assistant APIs to create a chatbot that can provide information on a catalog with more than 40,000 items. The issue is that with gpt4-preview, I am reaching a cost of 7/8$ for 4 hours of use in a single session… I need to use it for 10 sessions over 8 hours… the cost is too high. I am trying to figure out if it’s my chatbot that is doing something wrong or if this is just how it is for now and I need to wait and hope for a change.

1 Like

With that many items I can tell you that it’s likely just how much it costs.

This is the most advanced LLM in the world, and is priced as such right now. Hopefully we will continue to see the price drop as they scale up their systems to accommodate all the new users.

Check out Weaviate. If you are working with indexed items you will have much more liberties, tools, and functionality with much less cost.

Retrieval in my opinion should only be used for unstructured text, and is hella expensive

1 Like

Did you find any way to do it ? I’m in the same situation.

You should properly parse tabular data. There is definitely packages for that. After that you can go multiple ways:

  1. You can just put in the tabular data in your vector db (make sure the table data is contained within one chunk with a header).
  2. You can take each value of the table and generate an explanation for it. Then the llm is going to know exactly what each value means.
  3. Use a knowledge graph, something like neo4j. This is going to make sure that each value in your table (and other data in general) has a relation to all other data.

Depending on your needs one of these is going to work better than other.