I have a custom GPT that use PDF books as it’s source in the knowledge base. The GPTs limit I’ve seen so far is 10 total uploads and 10 files per upload (usually zipped) A few questions:
To assure the GPT is getting all the information it needs from the source what format is best? PDF, JSON, etc.
What is the best way to get maximum performance? Upload to knowledge base? Upload to a webserver and use an API?
For an optimal thorough search, is it best to separate everything by files (ex. chapters) or is it possible to group all PDFs into 1 large one and lose its ability to gather the information needed?
What are content management best practice recommendations to get the most of the source files?
Well, there is a 2 million token limit per file and 10 in total so you can upload 20M tokens worth of data and use that to base your GPT on, if you wish to go for a commercial level system then you need to switch to assistants on the API side of things, or make use of vector database storage and retrieval to build a similar solution, but of enterprise grade.
In vector DB terms… I like the people over at ChromaDB, but if you are after a ready to roll commercial solution you have to take a look at Azure Retrievals and pinecone.
Did you know that ChatGPT doesn’t save any files or information once you close the current conversion? Not even one!
I was so happy to hear about the custom bots you can now create and also upload some books. When I used that custom bot later, I couldn’t understand why the answers were so bad - I just asked the bot if it remembered the books I uploaded a while ago, and it didn’t!
Thanks to @Foxalabs and @robbar2015 for this conversation. It helped me understand best practice here.
So is this a good summary?
Convert all files to text files.
Per GPT Limits: 10 files
Per File Limits: 512MB (20MB for image files), 2M tokens
Per User Limits: 10 GB. Per Organization Limits: 100 GB.
Direct uploads to Knowledge are recommended for performance.
Separate content into smaller files for better search efficiency.
If knowledge is frequently updated, do not upload the file. Instead use a system to store the file or url and create an OpenAPI endpoint to fetch the content via an Action.
I’ll be reviewing vectorDB options in the coming week as I am attempting to mass index tens of thousands of urls and pdfs for clients so this could be a great path to send knowledge efficiently to a GPT.
Following up here as well, the limit is 20 files per GPT, I am trying to get clarity on what data format would be most optimal, not clear to me right now that there is a difference in text vs something else. Behind the scenes, we have parsing libraries for different data formats so assuming those libraries work well, there should really be no difference but confirming if that is true.
Months later, i never understood how this knowledge works… Today, after 1 month of missing testing, with same instructions, my gpt don’t use anymore its knowledge.
I leave the game
it seems that the knowledge base files are only consulted occassionally or on direct instruction to do so. it’s not fine-tuned with this additional data. so yeah, GPTs are more fun toys to play with, but for proper custom GPTs it’s useless.