Creating a bot using 100+ PDFS as the knowledge base

Hello! I’m speaking as an absolute novice to coding. I’m looking to create a custom GPT that scrapes hundreds of PDF files reporting on 2024 trends to use predominately in a Q&A functionality.

I tried to use this tutorial - Build AI chatbot with custom knowledge base using OpenAI API and GPT Index - which comes with its own colab source code written and whilst it was great, it was using outdated parameters for the temperature and model name (I think).

I know there’s a cost involved with retrieving information from each pdf which could certainly rack up. I’ve tried following other tutorials and looking at various GitHub posts but nothings really working.

Is anyone able to help? Or point me in the direction of a tutorial? Would appreciate it. Thanks.

There is a limit of 20 PDF which can be uploaded. So either you need to condense your entire content into 20 PDFs or you need to use a 3rd party service which allows 100+ PDFs

If you don’t want to build your own, you could try this GPT, which handles 500Mo, about 500 megabytes of storage:

BREEBS Chatbot (sorry, I’m not allowed to post links yet). :blush:

Hey Shawn, my understanding that even if I condensed the pdfs into 20, it will still go over the size restriction?

Thanks! I’m going to give BREEBS a go. I would love to build my own, I think it would be really satisfying. Just need to get around langchain and llama index! I’m using colab at the moment as I don’t have python too! An absolute rookie aha.

You don’t sound like a rookie to me. I have been delving into custom chatbots since February 2023. Like you, I haven’t found a good solution yet.

Most custom chatbots are built for customer service, not multi-document analysis and summarization. As a result, they only spew out singular facts. I’m considering hiring a developer to get this done.

Ideally it should not unless you have too big of PDF file ranging in GBs

Happy to build it for you, let me know. Or pair code it. Langchain llama index are right tools for something production grade with an extensible vector store.

You may be interested in this thread: Results Of Using The Assistant API . Author described his journey.

1 Like

Happy to discuss as long you don’t mind being part of a story I’m writing about my journey.

I’m not able to post links, so you will have to google “Toolhacker Building a Custom Document Chatbot.”

2 Likes

Thanks for sharing! This is great! I knew I would have to look at data sorting or scraping the pdfs into a text file but am secretly dreading it. Thanks again for sharing, super useful!

you can try converting the pdfs to text, I was able to fit way more into 10 txt files… and I asked the GPT Builder its preference and it said txt, so… Good luck!

1 Like

Try chatpdf. I’ve opened and worked with about 100, but it takes a bit of training. You’ll need to subscribe, of course. I’m not sure if it’s $5/month, and you have to open each one and compare them one by one.

Reading this makes me wonder if it’s all worth it. Getting it working in a day and a half seems a bit optimistic. And then it still generates errors by misreading the content. :thinking:

You can check GitHub - Anil-matcha/ChatPDF: Chat with any PDF. Easily upload the PDF documents you'd like to chat with. Instant answers. Ask questions, extract information, and summarize documents with AI. Sources included. to get an understanding of how to build a chat with pdfs app as knowledgebase

I have an Assistant working based on around 2Gb of PDF data. The hard work is in the data preparation!

And this was my recent reply to that thread: Results Of Using The Assistant API - #6 by SomebodySysop

I think the Assistants API is a great tool for most people who want a no code solution and don’t have a ton of files. I mean, what’s not to like? I just think that some people get so use to the ease-of-use aspect that they lose sight of what they really want to achieve: the best answers at the lowest cost possible.

I’m just saying that the Assistants API (or GPTs) aren’t necessarily the best solutions in all cases.

Have you considered using external tools? Check out Knowbase.ai, you can upload multiple files at once, up to 1GB per file and in minutes you have ready knowledge base to chat with.

Yes, GPT-Index or LLama-Index is good select. You need to use vector store like Pinecone, then it would work properly. I have already made it work with 1000page file using pinecone, and it would not be problem. Thank you!