I am Matteo, I am new in this forum and I am searching for a suggestion/solution to my AI project where the idea is to give to a large language model thousands of english PDFs (around 100k, all about the same topic) and then to be able to chat with it.
I followed several tutorials about RAG. They use HuggingFace where to download the LLM. Unfortunately, when I asked something to the model (Zephyr-7b), it took almost 10 minutes to reply to one question Moreover, sometimes it gave a sort of âhallucinationâ (for example, the title of the PDF is correct, but it gives erroneous years or URLs). Too much information for the model (for testing, I am just using 500 PDFs for now)? Chunk size is not good (I am using chunk_size=1000, chunk_overlap=0)?
I also tried to add prompt template, without any luck at all.
Finally, I discovered that RAG for deployment is too difficult and expensive.
Do you have any suggestion, please? Is maybe fine-tuning the solution?
I am using Python 3.10 and a cluster with 1 gpu 32 gb RAM and cpu 200 gb RAM available.
Ok first of all you need to pre-process your documents. Stop. Using PDFs. Thatâs the first step. You may be able to get away with it if the text isnât baked into the image. Even then it is messy as the text is lead line-by-line and not organized as it intuitively seems.
Second, if these 100k documents are nuanced in differences then you should instead run them all through a large token model while respecting the loss of context window that is found through empirical testing (search needle in the haystack llm for context). You should use the model to distill these documents.
THEN, with hopefully a much smaller, distilled version of your documents in preferably something like markdown you can begin performing analysis on it to understand âhowâ the model is viewing it. Is the model clustering certain documents together? Is it better to combine them? Is it failing to capture specific, important keywords/terminologies?
You canât just âsend itâ with some nasty abstracted-to-nothingness RAG framework like OpenAI Retrieval unless you are happy spending way too much money with very little results and large latency. You can send it if you properly process the documents and THEN use something like retrieval. With 100k documents though you need to put in some effort that doesnât involve pushing random buttons and moving dials.
Finally, you can use a smaller model to parse the returned content from embeddings and A) Return the filtered content or B) Determine that itâs insufficient and request more
I found it super easy to use. For 100 pdfs probably free as well.
And itâs easy to create a simple function to as a tool call in Assistants or completions.
(discalimer: Iâm not in anyway involved with Pongo, just a recent user)
UPDATE: they have (unfortunately) stopped the âstorageâ part of the business and currently only provide reranking. I ended up switching to Pinecone hosted on AWS (now a standard option) - which is (also) incredibly cheap and easy
I DO use Pinecone to rerank the results
So the flowâs I have:
store â openai text-embeddings-3 â pinecone index
search - > pinecone vector search - > add some relevant meta data) â Pongo rerank to order results
Ask for help and advice from the GPT regarding the pre-processing. It will reccommend python modules and help you write the code needed for extraction to json or jsonl.
The code gpt wrote was very useful:
Summary
import os
import json
import fitz # PyMuPDF
def pdf_to_json(directory):
ââ"
Converts all .pdf files in the specified directory to .json format.
Parameters:
directory (str): The path of the directory containing .pdf files.
"""
for filename in os.listdir(directory):
file_path = os.path.join(directory, filename)
base_name, ext = os.path.splitext(filename)
if ext.lower() == ".pdf":
with fitz.open(file_path) as doc:
text = ""
for page in doc:
# Extract text from each page
text += page.get_text()
# Prepare JSON content
json_content = {"text": text}
# Define output JSON file path
json_file_path = os.path.join(directory, f"{base_name}.json")
# Write the extracted text to a .json file
with open(json_file_path, "w") as json_file:
json.dump(json_content, json_file, indent=4)
print(f"Converted {filename} to {base_name}.json")
Example usage
directory = â/specify/your/directory/here/â # Adjust the path to your specific directory
pdf_to_json(directory)
I only had about 10 pdfâs I use for RAG, one is 600 pages, Iâm experimenting with, but I asked GPT4 to write a python script to convert all pdfâs to text and then jsonl within a specified directory. It worked first try when I ran the code, the formatting isnât perfect.
these were the python modules I installed to do the processing:
here is your code starting point after the pip installs â
import os
import json
import sqlite3
import fitz # PyMuPDF
import re
Now Iâm working on conversions going .pdf â jsonl â SQLite or .pdf â .txt â SQLite with GPT4 assist in the coding. Iâd like to put an entire 600 page text book into SQLite to provide contextual referencing for the complex financial reporting I am working on.
Thank you so much for your reply!
So you are suggesting to turn PDFs in json right? Then can RAG receive them as input? Obviously, following the process I used before (HuggingFace + Zephyr)
Thank you so much for your reply!
Unfortunetely, I was searching for something open-source and I think it canât handle 100k PDFs
In any case, Pong.ai looks so interesting and promising
Thank you so much for your long and detailed reply!
Unfortunately, my PDFs are made of text, images and table
Do you suggest to turn them into structured files? Like json?
Can I then process them with RAG using HuggingFace and a text generation model (like Zephyr)? Do you think fine-tuning could be a solution?
Sorry for my simple reply, I am still a beginnerâŚ
GPTs (not to be confused by Custom GPTs⌠which are not to be confused by the website CustomGPTs), but uhhh⌠I guess GPT models by OpenAI have been extensively trained on Markdown so I would personally convert the PDF to Markdown.
Images will not be captured by Retrieval. You could try using GPT-4V (Or GPT-Vision) to describe the image instead.
Try running a couple pages through cGPT (Not to be confused by Custom GPTs, but ChatGPT) to see how well it can convert it to Mardkwon
If you are able to convert it to JSON it may make more sense to turn to function-calling. Maybe just focus on Retrieval for now before digging too deep and ending up confused in Australia.
Yes.
It could but I would start with processing the documents before moving onto Fine-tuning. Itâs just adding another variable to an already complex equation.
@TeoR95 Thanks for sharing your use case. I have a couple of questions.
How do you measure the accuracy of your answers generated by the Language Model (LM)?
Sometimes, LLMs might amalgamate information from multiple sources, leading to inaccuracies. For instance, if the desired answer pertains solely to document1, but the LM mixes chunks from document1 and document2 indiscriminately, it could produce an erroneous response.
Could you clarify how you address or mitigate such potential issues in your workflow?
Yes, GPT models from OpenAI could be the easiest and the most efficient way, but I was searching for open-source solutions such as HuggingFace or anything else you would suggest! (I know, maybe I am not in the right forum for this, but I am pretty desperate and I am searching for answersâŚ)
I know so little, and I want to know more, but at the same time I do not want to be tedious, so just a few more questions:
Do you think PDFs â JSONs will make RAG faster?
I did not know I could use JSON files with RAG. I thought I needed PDFs for the embeddings and chuncking process! Do you know how should I perform it with JSON?
For now, I am just seeing if the information given by the model are present in the PDFs or not. But yes, I must develop something to evaluate the accuracy of the answers.
Yes, this is called âhallucinationâ or âconfabulationâ. The LLM (Large Language Model) tries to give you an answer, even if it does not know anything about the argument. The result is erroneous elements.
The passages I tried are: âpureâ fine-tuning = hallucination everywhere â> RAG + prompting = very very little hallucination, but too slow! â> (now) fine-tuning + prompting = testing. To mitigate hallucination I am actually trying different approaches such as prompting!
No problem. Keep in mind that some of what I say is opinionated and shouldnât be taken as de-facto truth.
If you can structure your document into something like a spreadsheet or JSON youâre better off using function calling as RAG. This provides all the benefits of typical database queries.
For unstructured text RAG via embeddings is your best bet. You want to clean & process it into a format that works.
PDFs are very messy. They are instructional files built to display images and text on a static page, typically for printing. There are services that clean them and return text but there will almost always be noise. If text is available itâs saved on a line-by-line basis. If thereâs no text an OCR is required which complicates things even further.
So itâs ALWAYS a good idea to transform the PDF to text, and then examine it before embedding it. I think itâs a disservice that OpenAI simply accepts PDFs without any documentation on the underlying functions.
If you can split the structured text into function calling. Yes. A lot faster, cheaper, and more accurate. Then you can keep in the unstructured text as embeddings, reducing the size and noise.