How to best use GPTs with PDF files?

I have a custom GPT that use PDF books as it’s source in the knowledge base. The GPTs limit I’ve seen so far is 10 total uploads and 10 files per upload (usually zipped) A few questions:

  • To assure the GPT is getting all the information it needs from the source what format is best? PDF, JSON, etc.
  • What is the best way to get maximum performance? Upload to knowledge base? Upload to a webserver and use an API?
  • For an optimal thorough search, is it best to separate everything by files (ex. chapters) or is it possible to group all PDFs into 1 large one and lose its ability to gather the information needed?
  • What are content management best practice recommendations to get the most of the source files?

Thanks in Advance

3 Likes

You can have up to 2million tokens worth of data per file.
You can have up to 20 files.
The most performant format is text.

The search system should chunk and search files automatically using the best methods.

3 Likes

So when you make your own GPT you can only upload 10 files? That doesn’t seem like much. What if you’d like to upload 500 pdf files?

Well, there is a 2 million token limit per file and 10 in total so you can upload 20M tokens worth of data and use that to base your GPT on, if you wish to go for a commercial level system then you need to switch to assistants on the API side of things, or make use of vector database storage and retrieval to build a similar solution, but of enterprise grade.

1 Like

Yes 10 separate file upload but you can have 10 files per upload.

Interesting. Of the 2 options what do you recommend more? Was thinking of using wasabi and a n API.

In vector DB terms… I like the people over at ChromaDB, but if you are after a ready to roll commercial solution you have to take a look at Azure Retrievals and pinecone.

2 Likes

Did you know that ChatGPT doesn’t save any files or information once you close the current conversion? Not even one!

I was so happy to hear about the custom bots you can now create and also upload some books. When I used that custom bot later, I couldn’t understand why the answers were so bad - I just asked the bot if it remembered the books I uploaded a while ago, and it didn’t!

This is annoying!

Thanks to @Foxabilo and @robbar2015 for this conversation. It helped me understand best practice here.

So is this a good summary?

  1. Convert all files to text files.
  2. Per GPT Limits: 10 files
  3. Per File Limits: 512MB (20MB for image files), 2M tokens
  4. Per User Limits: 10 GB. Per Organization Limits: 100 GB.
  5. Direct uploads to Knowledge are recommended for performance.
  6. Separate content into smaller files for better search efficiency.
  7. If knowledge is frequently updated, do not upload the file. Instead use a system to store the file or url and create an OpenAPI endpoint to fetch the content via an Action.

‘Reference: File uploads with GPTs and Advanced Data Analysis in ChatGPT | OpenAI Help Center

I’ll be reviewing vectorDB options in the coming week as I am attempting to mass index tens of thousands of urls and pdfs for clients so this could be a great path to send knowledge efficiently to a GPT.

7 Likes

Following up here as well, the limit is 20 files per GPT, I am trying to get clarity on what data format would be most optimal, not clear to me right now that there is a difference in text vs something else. Behind the scenes, we have parsing libraries for different data formats so assuming those libraries work well, there should really be no difference but confirming if that is true.

4 Likes

Tested right now, 18 separated files with no errors, very good news! Is the size cap still set at 10gb per user? (Plus subs)

Would be great to have insights and tips about best format, i confirm, in terms of end-user-experience, a better responses time and behavior with .txt

Looking for a confirm

Thank you!