Help with PDF-Based Chatbot and hallucination issues

Hello community,

I’m working on a project aimed at creating a chatbot that uses PDF files as its database. The goal is for the user to be able to ask questions and for the API to provide exact responses extracted from the PDFs. Currently, I’m using the OpenAI API.
My main issue is the model’s hallucinations. I have read in some forums that the OpenAI API tends to hallucinate when is provided with many files and I’m currently seeking solutions to this problem and also some experiences on what is the best approach to take for this problem. I’m not so sure if it’s a problem that only the OpenAI API has or is a common problem with IA in general. Also for some additional context, I have 50 PDF files and each has between 50 and 100 pages in total.
Here are my questions:

  • How can I reduce or eliminate hallucinations when working with multiple PDFs?
  • What techniques or approaches (ex.: ML or DL) are recommended to improve the chatbot’s accuracy? Specifically, I want it to extract information directly from the PDFs without generating false data caused by hallucinations.

Thank you very much in advance

Welcome @dzeg,

I’d recommend extracting text from a PDF to a text file and supplying that, especially if you want to build an efficient knowledge base for the assistant.

The reason is that PDF is a pretty complex format where data can be text, scanned text, image, or a mixture of these, and this makes it very difficult to ensure that the assistant can really access the knowledge you want it to use.

As far as I understand, under the hood, when you pass a PDF to the assistant, that’s likely the pipeline that’s used to process PDFs.

2 Likes

@dzeg - Try to semantically chunk and add tags to your pdf files if you are performing RAG. I think there is a whole thread on this forums where people discuss the same. Cheers!

1 Like

All LLMs (Large Language Model) will hallucinate, not just ChatGPT or OpenAI.

Also take a look at RAG (Retrieval Augmentented Generation) for other ideas.

Don’t forget that ChatGPT can actually be of help in learning about such.

Prompt

I’m working on a project aimed at creating a chatbot that uses PDF files as its database. The goal is for the user to be able to ask questions and for the API to provide exact responses extracted from the PDFs. Currently, I’m using the OpenAI API.
My main issue is the model’s hallucinations. I have read in some forums that the OpenAI API tends to hallucinate when is provided with many files and I’m currently seeking solutions to this problem and also some experiences on what is the best approach to take for this problem. I’m not so sure if it’s a problem that only the OpenAI API has or is a common problem with IA in general. Also for some additional context, I have 50 PDF files and each has between 50 and 100 pages in total.
Here are my questions:

  • How can I reduce or eliminate hallucinations when working with multiple PDFs?
  • What techniques or approaches (ex.: ML or DL) are recommended to improve the chatbot’s accuracy? Specifically, I want it to extract information directly from the PDFs without generating false data caused by hallucinations.

Reply

Creating a chatbot that extracts information directly from PDFs is a challenging task, particularly when it comes to reducing hallucinations and ensuring accurate responses. Here are some approaches and techniques that can help improve the accuracy and reduce hallucinations:

1. Use an Embedding-Based Retrieval System

An embedding-based retrieval system can significantly reduce hallucinations by ensuring the responses are grounded in the actual content of the PDFs.

  • Vector Embeddings: Convert the text content of your PDFs into vector embeddings using models like OpenAI’s text-embedding-ada-002.
  • Similarity Search: Use similarity search techniques (e.g., cosine similarity) to find the most relevant passages from the PDFs in response to a query.
  • Retrieval-Augmented Generation (RAG): Use the retrieved passages as context for generating answers. This way, the model’s output is directly tied to the content of the PDFs.

2. Fine-Tuning and Training

Fine-tuning a language model on a dataset that closely resembles your use case can help reduce hallucinations.

  • Domain-Specific Fine-Tuning: Fine-tune a pre-trained language model on a domain-specific dataset. This can include documents similar to your PDFs.
  • Supervised Training: Create a dataset of questions and answers based on your PDFs. Use this to train or fine-tune your model, ensuring it learns to extract information correctly.

3. Chunking and Preprocessing

Proper preprocessing of the PDFs is crucial for accurate information retrieval.

  • Chunking: Break down the PDFs into manageable chunks (e.g., paragraphs or sections). This makes it easier for the model to retrieve relevant information.
  • Metadata Extraction: Extract and index metadata (e.g., headings, subheadings) to help the retrieval process.

4. Use of External Tools and APIs

Leveraging existing tools and APIs can enhance the information retrieval process.

  • ElasticSearch: Use ElasticSearch to index the PDF content and perform efficient full-text search queries.
  • Document AI Solutions: Consider using document AI solutions like Google Cloud Document AI, which are designed to extract structured data from unstructured documents.

5. Multi-Modal Approaches

Combining multiple models and techniques can enhance accuracy.

  • Pipeline Approach: Use a combination of retrieval and generative models. For example, use a retrieval model to find relevant passages and a generative model to construct the answer.
  • Hybrid Systems: Combine rule-based systems with AI models. Rule-based systems can handle straightforward queries, while AI models can handle more complex queries.

6. Evaluation and Feedback Loop

Continuous evaluation and improvement are key to reducing hallucinations.

  • Human-in-the-Loop: Incorporate a human-in-the-loop process where human reviewers validate and correct the model’s responses.
  • Continuous Learning: Implement a feedback loop where the model learns from its mistakes over time.

Tools and Libraries

  • PyPDF2, PDFMiner: Libraries for extracting text from PDF files.
  • OpenAI GPT-4: For fine-tuning and generating responses.
  • FAISS, Annoy: Libraries for efficient similarity search and nearest neighbor retrieval.
  • LangChain: A library that helps create advanced language model applications.

Example Workflow

  1. Text Extraction: Extract text from PDFs using PyPDF2 or PDFMiner.
  2. Indexing: Use a tool like ElasticSearch to index the text.
  3. Embedding: Convert text chunks into embeddings using OpenAI’s embedding models.
  4. Retrieval: Perform similarity search to retrieve the most relevant chunks.
  5. Response Generation: Use a fine-tuned GPT model to generate responses based on the retrieved chunks.
  6. Evaluation: Implement a feedback loop to continuously evaluate and improve the model.

By combining these techniques, you can create a robust system that minimizes hallucinations and provides accurate responses based on the content of your PDFs.

1 Like

Hello ,
I had a similar problem - use case - multiple PDF- larger than 80 pages >15 files with word/intent search - semantic search quality observed became worse with low TF-IDF score.

Immediate relief

  1. Use an index based vector search like - Sinequa with a hybrid neural search - this is an evolved search summarization than semantic search .
  2. provide tag as examples in the prompt design to reduce contextual loss.

PS: your problem has got quadrupled since you have >80 pages on PDF - and based on document chunking could have exhausted the tokens allocated in gpt4 . Plus multiple docs. I found an index based solution to be useful. Perhaps also look for tools like Lynx etc. for controlling the model hallucination and detecting it.

Let me know if this has worked for you. happy to help further

1 Like

I am new in prompting and facing with hallucinations, wrong page nos, missing some of the data. Can you guide me with the process that you mentioned in details. I want to learn about then and try to implement them.
Thank you in advance.

Hi @dzeg

Seems like Breebs could be a solution based on the number of PDF you have
It handles for you all the RAG, and it’s 100%free