Retrieval Augmented Generation (RAG) with 100k PDFs?! Too slow!

Hello everyone,

I am Matteo, I am new in this forum and I am searching for a suggestion/solution to my AI project where the idea is to give to a large language model thousands of english PDFs (around 100k, all about the same topic) and then to be able to chat with it.

I followed several tutorials about RAG. They use HuggingFace where to download the LLM. Unfortunately, when I asked something to the model (Zephyr-7b), it took almost 10 minutes to reply to one question :frowning: Moreover, sometimes it gave a sort of ā€œhallucinationā€ (for example, the title of the PDF is correct, but it gives erroneous years or URLs). Too much information for the model (for testing, I am just using 500 PDFs for now)? Chunk size is not good (I am using chunk_size=1000, chunk_overlap=0)?

I also tried to add prompt template, without any luck at all.

Finally, I discovered that RAG for deployment is too difficult and expensive.
Do you have any suggestion, please? Is maybe fine-tuning the solution?
I am using Python 3.10 and a cluster with 1 gpu 32 gb RAM and cpu 200 gb RAM available.

Thank you so much for your help and time! :slight_smile:

1 Like

Ok first of all you need to pre-process your documents. Stop. Using PDFs. Thatā€™s the first step. You may be able to get away with it if the text isnā€™t baked into the image. Even then it is messy as the text is lead line-by-line and not organized as it intuitively seems.

Second, if these 100k documents are nuanced in differences then you should instead run them all through a large token model while respecting the loss of context window that is found through empirical testing (search needle in the haystack llm for context). You should use the model to distill these documents.

THEN, with hopefully a much smaller, distilled version of your documents in preferably something like markdown you can begin performing analysis on it to understand ā€œhowā€ the model is viewing it. Is the model clustering certain documents together? Is it better to combine them? Is it failing to capture specific, important keywords/terminologies?

You canā€™t just ā€œsend itā€ with some nasty abstracted-to-nothingness RAG framework like OpenAI Retrieval unless you are happy spending way too much money with very little results and large latency. You can send it if you properly process the documents and THEN use something like retrieval. With 100k documents though you need to put in some effort that doesnā€™t involve pushing random buttons and moving dials.

Finally, you can use a smaller model to parse the returned content from embeddings and A) Return the filtered content or B) Determine that itā€™s insufficient and request more

Reduce noise. Increase signal.

8 Likes

Take a look at https://docs.pongo.ai/what-is-pongo

I found it super easy to use. For 100 pdfs probably free as well.

And itā€™s easy to create a simple function to as a tool call in Assistants or completions.

(discalimer: Iā€™m not in anyway involved with Pongo, just a recent user)

UPDATE: they have (unfortunately) stopped the ā€˜storageā€™ part of the business and currently only provide reranking. I ended up switching to Pinecone hosted on AWS (now a standard option) - which is (also) incredibly cheap and easy
I DO use Pinecone to rerank the results

So the flowā€™s I have:

store ā†’ openai text-embeddings-3 ā†’ pinecone index

search - > pinecone vector search - > add some relevant meta data) ā†’ Pongo rerank to order results

2 Likes

Ask for help and advice from the GPT regarding the pre-processing. It will reccommend python modules and help you write the code needed for extraction to json or jsonl.

The code gpt wrote was very useful:

Summary

import os
import json
import fitz # PyMuPDF

def pdf_to_json(directory):
ā€œā€"
Converts all .pdf files in the specified directory to .json format.

Parameters:
directory (str): The path of the directory containing .pdf files.
"""
for filename in os.listdir(directory):
    file_path = os.path.join(directory, filename)
    base_name, ext = os.path.splitext(filename)

    if ext.lower() == ".pdf":
        with fitz.open(file_path) as doc:
            text = ""
            for page in doc:
                # Extract text from each page
                text += page.get_text()

        # Prepare JSON content
        json_content = {"text": text}

        # Define output JSON file path
        json_file_path = os.path.join(directory, f"{base_name}.json")

        # Write the extracted text to a .json file
        with open(json_file_path, "w") as json_file:
            json.dump(json_content, json_file, indent=4)

        print(f"Converted {filename} to {base_name}.json")

Example usage

directory = ā€œ/specify/your/directory/here/ā€ # Adjust the path to your specific directory
pdf_to_json(directory)

I only had about 10 pdfā€™s I use for RAG, one is 600 pages, Iā€™m experimenting with, but I asked GPT4 to write a python script to convert all pdfā€™s to text and then jsonl within a specified directory. It worked first try when I ran the code, the formatting isnā€™t perfect.

these were the python modules I installed to do the processing:

Summary

pip install PyMuPDF
pip install pdfminer.six
pip install torch

here is your code starting point after the pip installs ā†’

import os
import json
import sqlite3
import fitz # PyMuPDF
import re

Now Iā€™m working on conversions going .pdf ā†’ jsonl ā†’ SQLite or .pdf ā†’ .txt ā†’ SQLite with GPT4 assist in the coding. Iā€™d like to put an entire 600 page text book into SQLite to provide contextual referencing for the complex financial reporting I am working on.

1 Like

Thank you so much for your reply!
So you are suggesting to turn PDFs in json right? Then can RAG receive them as input? Obviously, following the process I used before (HuggingFace + Zephyr)

1 Like

Thank you so much for your reply!
Unfortunetely, I was searching for something open-source and I think it canā€™t handle 100k PDFs :frowning:
In any case, Pong.ai looks so interesting and promising :slight_smile:

Thank you so much for your long and detailed reply!
Unfortunately, my PDFs are made of text, images and table :frowning:

Do you suggest to turn them into structured files? Like json?
Can I then process them with RAG using HuggingFace and a text generation model (like Zephyr)? Do you think fine-tuning could be a solution?

Sorry for my simple reply, I am still a beginnerā€¦

GPTs (not to be confused by Custom GPTsā€¦ which are not to be confused by the website CustomGPTs), but uhhhā€¦ I guess GPT models by OpenAI have been extensively trained on Markdown so I would personally convert the PDF to Markdown.

Images will not be captured by Retrieval. You could try using GPT-4V (Or GPT-Vision) to describe the image instead.

Try running a couple pages through cGPT (Not to be confused by Custom GPTs, but ChatGPT) to see how well it can convert it to Mardkwon

If you are able to convert it to JSON it may make more sense to turn to function-calling. Maybe just focus on Retrieval for now before digging too deep and ending up confused in Australia.

Yes.

It could but I would start with processing the documents before moving onto Fine-tuning. Itā€™s just adding another variable to an already complex equation.

2 Likes

@TeoR95 Thanks for sharing your use case. I have a couple of questions.

  • How do you measure the accuracy of your answers generated by the Language Model (LM)?
  • Sometimes, LLMs might amalgamate information from multiple sources, leading to inaccuracies. For instance, if the desired answer pertains solely to document1, but the LM mixes chunks from document1 and document2 indiscriminately, it could produce an erroneous response.

Could you clarify how you address or mitigate such potential issues in your workflow?

Thank you again for your reply!

Yes, GPT models from OpenAI could be the easiest and the most efficient way, but I was searching for open-source solutions such as HuggingFace or anything else you would suggest! (I know, maybe I am not in the right forum for this, but I am pretty desperate and I am searching for answersā€¦)

I know so little, and I want to know more, but at the same time I do not want to be tedious, so just a few more questions:

  1. Do you think PDFs ā†’ JSONs will make RAG faster?
  2. I did not know I could use JSON files with RAG. I thought I needed PDFs for the embeddings and chuncking process! Do you know how should I perform it with JSON?

Thank you again! I really appreciate your help :slight_smile:

Jeez, guess I undershot

Not @TeoR95 's fault though, this stuffs super confusing. Reading the documentation, thereā€™s nothing to really suggest that it wouldnā€™t work.

Nice questions! I will try to give you my tips:

  • For now, I am just seeing if the information given by the model are present in the PDFs or not. But yes, I must develop something to evaluate the accuracy of the answers.

  • Yes, this is called ā€œhallucinationā€ or ā€œconfabulationā€. The LLM (Large Language Model) tries to give you an answer, even if it does not know anything about the argument. The result is erroneous elements.

The passages I tried are: ā€œpureā€ fine-tuning = hallucination everywhere ā€”> RAG + prompting = very very little hallucination, but too slow! ā€”> (now) fine-tuning + prompting = testing. To mitigate hallucination I am actually trying different approaches such as prompting!

Hope this is useful for you :slight_smile:

2 Likes

No problem. Keep in mind that some of what I say is opinionated and shouldnā€™t be taken as de-facto truth.

If you can structure your document into something like a spreadsheet or JSON youā€™re better off using function calling as RAG. This provides all the benefits of typical database queries.

For unstructured text RAG via embeddings is your best bet. You want to clean & process it into a format that works.

PDFs are very messy. They are instructional files built to display images and text on a static page, typically for printing. There are services that clean them and return text but there will almost always be noise. If text is available itā€™s saved on a line-by-line basis. If thereā€™s no text an OCR is required which complicates things even further.

So itā€™s ALWAYS a good idea to transform the PDF to text, and then examine it before embedding it. I think itā€™s a disservice that OpenAI simply accepts PDFs without any documentation on the underlying functions.

If you can split the structured text into function calling. Yes. A lot faster, cheaper, and more accurate. Then you can keep in the unstructured text as embeddings, reducing the size and noise.

Function calling

1 Like