Creating a bot using 100+ PDFS as the knowledge base

kristie.to · January 22, 2024, 5:30pm

Hello! I’m speaking as an absolute novice to coding. I’m looking to create a custom GPT that scrapes hundreds of PDF files reporting on 2024 trends to use predominately in a Q&A functionality.

I tried to use this tutorial - Build AI chatbot with custom knowledge base using OpenAI API and GPT Index - which comes with its own colab source code written and whilst it was great, it was using outdated parameters for the temperature and model name (I think).

I know there’s a cost involved with retrieving information from each pdf which could certainly rack up. I’ve tried following other tutorials and looking at various GitHub posts but nothings really working.

Is anyone able to help? Or point me in the direction of a tutorial? Would appreciate it. Thanks.

shawnharris963 · January 22, 2024, 5:46pm

There is a limit of 20 PDF which can be uploaded. So either you need to condense your entire content into 20 PDFs or you need to use a 3rd party service which allows 100+ PDFs

ubercool · January 22, 2024, 6:09pm

If you don’t want to build your own, you could try this GPT, which handles 500Mo, about 500 megabytes of storage:

BREEBS Chatbot (sorry, I’m not allowed to post links yet).

kristie.to · January 22, 2024, 6:11pm

Hey Shawn, my understanding that even if I condensed the pdfs into 20, it will still go over the size restriction?

kristie.to · January 22, 2024, 6:14pm

Thanks! I’m going to give BREEBS a go. I would love to build my own, I think it would be really satisfying. Just need to get around langchain and llama index! I’m using colab at the moment as I don’t have python too! An absolute rookie aha.

ubercool · January 22, 2024, 6:45pm

You don’t sound like a rookie to me. I have been delving into custom chatbots since February 2023. Like you, I haven’t found a good solution yet.

Most custom chatbots are built for customer service, not multi-document analysis and summarization. As a result, they only spew out singular facts. I’m considering hiring a developer to get this done.

shawnharris963 · January 22, 2024, 9:04pm

Ideally it should not unless you have too big of PDF file ranging in GBs

engagepy · January 22, 2024, 9:22pm

Happy to build it for you, let me know. Or pair code it. Langchain llama index are right tools for something production grade with an extensible vector store.

konradk · January 22, 2024, 9:25pm

You may be interested in this thread: Results Of Using The Assistant API . Author described his journey.

ubercool · January 22, 2024, 10:27pm

Happy to discuss as long you don’t mind being part of a story I’m writing about my journey.

I’m not able to post links, so you will have to google “Toolhacker Building a Custom Document Chatbot.”

kristie.to · January 22, 2024, 11:40pm

Thanks for sharing! This is great! I knew I would have to look at data sorting or scraping the pdfs into a text file but am secretly dreading it. Thanks again for sharing, super useful!

laloadrianmorales · January 22, 2024, 11:56pm

you can try converting the pdfs to text, I was able to fit way more into 10 txt files… and I asked the GPT Builder its preference and it said txt, so… Good luck!

kangwoner · January 23, 2024, 1:03am

Try chatpdf. I’ve opened and worked with about 100, but it takes a bit of training. You’ll need to subscribe, of course. I’m not sure if it’s $5/month, and you have to open each one and compare them one by one.

ubercool · January 23, 2024, 6:32am

Reading this makes me wonder if it’s all worth it. Getting it working in a day and a half seems a bit optimistic. And then it still generates errors by misreading the content.

nick.mckenna · January 23, 2024, 7:42am

I have an Assistant working based on around 2Gb of PDF data. The hard work is in the data preparation!

SomebodySysop · January 23, 2024, 9:53am

And this was my recent reply to that thread: Results Of Using The Assistant API - #6 by SomebodySysop

I think the Assistants API is a great tool for most people who want a no code solution and don’t have a ton of files. I mean, what’s not to like? I just think that some people get so use to the ease-of-use aspect that they lose sight of what they really want to achieve: the best answers at the lowest cost possible.

I’m just saying that the Assistants API (or GPTs) aren’t necessarily the best solutions in all cases.

maciej.morzywolek · February 1, 2024, 7:53am

Have you considered using external tools? Check out Knowbase.ai, you can upload multiple files at once, up to 1GB per file and in minutes you have ready knowledge base to chat with.

freelancedev112 · February 7, 2024, 3:27pm

Yes, GPT-Index or LLama-Index is good select. You need to use vector store like Pinecone, then it would work properly. I have already made it work with 1000page file using pinecone, and it would not be problem. Thank you!

bas.degoei · June 25, 2024, 5:51am

Not sure if you’re still working on this. You can achieve this with Instabase AI Hub, which runs with multiple GPTs depending on which one you choose. Go to aihub, sign in with any Gmail, go to Converse, upload as many PDFs regardless of length (and any other doc, scans, handwritten notes, excel docs), select Multistep model, ask a few sample questions and click ‘create chatbot’. It will generate a chatbot with a link and API which can be used wherever.

Topic		Replies	Views
Using large PDFs to make a ChatBot API chatgpt , api	21	5932	December 15, 2023
What is the best way to parse a PDF file with ChatGPT? API	9	38469	November 16, 2024
Retrieval Augmented Generation (RAG) with 100k PDFs?! Too slow! Community pdf , llm , rag , development	13	18058	October 31, 2024
Making a chatbot that answers questions from a book API api	3	4224	December 15, 2023
My GPT - Knowledge base - Best practices GPT builders	7	16747	January 25, 2024

Creating a bot using 100+ PDFS as the knowledge base

Related topics