I want to create my own custom GPT that i can share with others via chat GPT store. I don’t have a chat GPT premium yet. Before subscribing i want to know one thing about custom GPT knowledge.
I want to train my custom GPT with thousands of articles, but what i read from online sources is that you can’t upload more than 10 documents in one custom GPT. I want to know how to overcome this barrier and train my custom GPT on these thousands of articles.
I’m a beginner and would appreciate if your answers are geared towards one.
That is a good question, one that isn’t explicitly answered for ChatGPT, but we can guess that it follows the same pattern as the API cousin: 2 million extracted tokens per document (about 1 million words).
The highest quality is where you are in control of the language - not sending PDF files or word processor documents that need to (unreliably) be processed to have text extracted, but by submitting curated plain text with sections and headings that the AI and the search can read directly.
I would watch the progress of other forum topics, where the knowledge retrieval for GPT files, especially those proprietary formats, isn’t currently performing to expectations.
Hopefully that simple tip allows you a more performative GPT that can succeed when shared.
Hey thanks for replying. I understand the importance of plain text from what you said, about how many plain text files I can upload to train custom GPT? is it 10? That would be too less for what I was planning, I wanted to upload thousands of text files that have articles within it. I’m a beginner and would appreciate if you could keep it in layman’s terms. Thank you for your time.
To go beyond the document search abilities and limitations, you would really need to develop your own API on a web server that the AI can call upon via an action. However, then, if your search is based on AI embeddings semantic search with features of API services, you are paying for the additional AI costs of those that use your GPT - as a gift.
Also, then, one considers if the exact snippet containing an answer can even be found in 10 million words. Let’s ask the AI if the GPT already meets your expectations:
To provide a tangible sense of how long 10 million words might be, let’s break it down using familiar references such as books and encyclopedias.
Average Book Length: The average length of a novel is roughly 80,000 to 100,000 words. Using 90,000 words as an average, 10 million words would be equivalent to about 111 novels of average length.
Encyclopedia Sets: The Encyclopædia Britannica, one of the most comprehensive encyclopedias, has about 40 million words across all its volumes. Therefore, 10 million words would be about 25% of the entire Encyclopædia Britannica set, or roughly equivalent to 8-10 volumes of a standard 32-volume set.
Famous Long Books: To give an example of particularly long books, “War and Peace” by Leo Tolstoy is about 587,287 words long, and “In Search of Lost Time” by Marcel Proust is around 1.2 million words in total. Therefore, 10 million words would be equivalent to approximately 17 copies of “War and Peace” or about 8 copies of “In Search of Lost Time.”
These examples should give you a sense of the sheer volume of content that 10 million words represent, equating to a substantial library of books or a significant portion of a comprehensive encyclopedia set.
Hey, I’m not a dev and definitely won’t be creating APIs. You said it could be 1 million words but I’m asking how many (number) files can one upload to custom gpt knowledge base. I read in these threads that it is 10-20, that’s why I was wondering if I can do thousands of articles.
Here is one of the threads that started triggered my question - \https://community.openai.com/t/gpts-knowledge-capacity-limits/492955
help.openai.com is where you’ll find ChatGPT answers (this is primarily a developer forum). For example:
How many files can I upload at once per GPT?
Up to 20 files per GPT for the lifetime of that GPT.
What are those file upload size restrictions?
All files uploaded to a GPT or a ChatGPT conversation have a hard limit of 512MB per file.
All text and document files uploaded to a GPT or to a ChatGPT conversation are capped at 2M tokens per file. This limitation does not apply to spreadsheets.
For images, there’s a limit of 20MB per image.
Additionally, there are usage caps:
– Each end-user is capped at 10GB.
– Each organization is capped at 100GB.
– Note: An error will be displayed if a user/org cap has been hit.
Hey thanks! I read this 20-file point earlier, but I was wondering if I could somehow upload thousands of text files about a specific subject to train custom GPT on it. Like is there some beginner-friendly method that can indirectly allow me to upload or train my custom gpt with this huge corpus of text?
I intend to make it public and shareable through the chat gpt store.
The AI model cannot be “trained” in the sense of traditional AI fine-tuning.
All the users of ChatGPT Plus get their normal GPT-4 AI model version, upon which the instructions of a GPT are placed.
The retrieval of document knowledge (if larger than just a thousand words or so automatically injected) is by a search function. The AI basically writes a query like it would to a web search engine, gets top results back. If the citation looks like more needs to be read for the AI to understand fully, the AI of ChatGPT can “click” to scroll through more of the document at that point behind the scenes.
So, documents don’t train an AI, they provide on-demand augmentation – search queries to find if there is more knowledge to fulfill a user’s input.
Hey thank you for continuously replying. I was not using the trained word in that sense. I just want chat gpt to use the text files that I provide it and act as an expert in that domain, for eg after feeding a lot of political science research papers, I want to make a custom gpt that is a political scientist GPT. So I just want to know how many files can I upload with the goal of creating such a custom GPT. Of course, this can apply to any field, math, physics, literature etc.
Edit: Another example, I saw Khan Academy’s custom GPT in the store, they must’ve used their tutorials to create that custom gpt, how were they able to upload such an amount of data to create this custom GPT?
The file count is 20.
The token count per file is 2M.
The ideal strategy is to combine like data into a continuous text file up to the maximum.
AI “tokens” can be counted on a site like “OpenAI tokenizer” or “vercel tokenizer” so you can get an idea of the number of characters per token (about 4.2 in English) and the length of file facing rejection upon upload.
Sophisticated solutions simply do not use consumer-oriented all-purpose ChatGPT methods alone. Developer products meant for wider audiences are not GPTs – only available to Plus subscribers (to the benefit of OpenAI).
Hi Freddy. Glad to meet here. I have been lookin for a same response to my use case. May i know if you are able to find a solution. Good point on Khan Academy. I would also say the research based GPTs must have been done on some superior logic or capability. Happy to connect if you would like to. My email is rajesh.lakka@gmail.com.
Or can we not use in the backend a smart retrieval function from a database perhaps ?
Train GPT to get the gist of what the user is requesting, send those variables in the backend to the db to retrieve content and then have your custom gpt do analysis on it. Ofcourse that creates another challenge, the input token count is 128k so in case i am retrieving multiple files, i am just defeating the purpose of having large knowledge base. Is there a way that files are temporarily stored in the GPT during retrieval so that from the backend whatever data is being fetches is uploaded first in the files ?
Is there any way to upload more than 20 documents into my personal GPT created with ChatGPT4? Sorry, but I’m technical enough to know how to create an API to the files on my laptop or blogs that I’ve written on a website. Thanks