I’m trying to upload files to create a custom GPT. I am able to upload PDFs fine, but the larger challenge is that I have 3,000 blog posts that I want to upload. I used a python script to combine the source files (it’s a Jekyll site) into a single file that tried uploading it into the GPT interface, but I get an error that says “Unable to extract text from {file name}”.
I couldn’t find the documentation for creating a custom GPT. What are the file size limitations (max size, max number)? Also, what’s the optimal encoding? HTML, Markdown, plain text? What kind of output will be best parsed? Should I generate all my blog posts into several massive PDFs to upload? Are multiple files better than single files? Why do I keep getting the error about being unable to extract text? It would be great to have some kind of guidance related to the file upload part of the custom GPT interface.
GPT’s can have 10 files max, not sure on the file size at this moment in time.
File encoding of markdown, CSV, or the like is fine, if you have something like XML and json and the file is not correctly structured that will cause file upload errors, so a simple csv or HTML generated from a page view might be best.
Thanks! I tried converting all the source file format to markdown – still got the error. I tried stripping out all the html and markdown formatting except for headings – still got the error.
BTW, is there official documentation anywhere on creating custom GPTs?
Its still pretty new so there’s not too much documentation for Chat GPT Assistants but use the API Assistants as your rough guide as this is what Chat GPT GPT’s use but with less features.
Max file size is 512 MB but it should tell you if that is the issue.
Try passing a big chunk of it through the GPT chat and tell it to save it as a file and see how it saves it. maybe the HTML needs to be exited in the file.
Surround the file in ``` on either side so it knows its a code block
It would be super helpful to have this nailed down, hopefully the ``` code block markers fix it, but I am able to upload pdf’s and html files all day long with no problems but some files cause issues, would be great to know the cause/solution/reasoning behind this.
I’m guessing the file is too big, even though it meets the file size requirements. for example, pasting this into Google Docs makes it hang. I’m going to try limiting the file to 2MB and converting to PDF. Will report back.
Maybe you could try using langchain to achieve this? Create the code in python and add chroma as a vector store. So each time you need info from those files langchain it’s going to tokenize your info so you can querying later.
You can create this code (OpenAPI Schema & API call) using this GPT:
used a liquid script to generate all the post content as I built my jekyll site locally. note: The output was so large it wouldn’t load in a browser window.
converted the HTML to PDF using Prince, stripping out sidebars, headers, etc.
split the content into 1000 page chunks using pdftk
uploaded the files (about 7 files)
Seems to work. Sometimes an upload would fail the first time, but work on the second upload attempt.
On Twitter someone said, “I learnt that knowledge files for custom GPTs shall be 8000 tokens. It accepts multiple files but recommends to combine info into 8000 tokens. 8000 tokens is roughly about 7000 words.” I’m not sure where that info came from, but this got me thinking. What if OpenAI added some documentation for one of their most talked about features at DevDay? I could be wrong, but I seem to recall that there’s a role of people who do just that – write documentation for features being released. There’s a name for the role. Can’t quite remember. … Just kidding, this is me being sarcastic, as I’m a technical writer. It’s a major flop for any company to release a feature of this scale with no documentation.
I had a similar error using the API. ‘message’: 'Failed to index file: Error extracting text from file file-k6B5yWaA9ZnDKyoWfPxvIBz4, detail: File contains too may tokens. Max allowed tokens per file is 2000000. Seems like your blog posts would be more than 2 million tokens. Hope that helps
I don’t think that the 8000 tokens limit is true. Ada has an 8000 tokens limit, but the idea with the new product, GPT, and the Assistant API, is that they chunk the file into ada compatible chunk sizes and then index the vector for retrieval, if that is not the case then it would not make sense.
I agree, I started with JSON, got a weird error I thought was related to tokenization so uploaded a .txt and got the same 2,000,000 token warning. Converted the .txt to a PDF and still got the error. (This was in the playground and API FYI)
Edit: I was only able to get this working by manually calculating token length and saving file parts under the limit. I’m working in node, pseudo code below
// productCatalog is an array of strings containing product info
let productsForCurrentFile = [];
let currentTotalTokenLength = 0;
let currentFileIndex = 1;
for (let i = 0; i <= productCatalog.length; i++) {
const product = productCatalog[i];
const productString = buildProductInformationString(product) // some custom function to produce the information you want to have in the model
const p = product.split(""); // create an array of char
const productTokenLengthEstimate = ( (p.length) * 0.36788) + 2;
// = total characters * (1/e) + buffer
if (currentTotalTokenLength > 1955000) { // again more imprecision lol
// write out file
const productsToWrite = productsForCurrentFile.join('');
const filePath = `training/productCatalog_${currentFileIndex}.txt`;
fs.writeFileSync(filePath, productsToWrite);
currentFileIndex++;
productsForCurrentFile = [];
currentTokenLength = 0;
} else {
productsForCurrentFile.push(product);
}
}
Just an example, you’d need to customize it for the case that the data you’re processing is under the limit (but thats not what we’re here talking about)
I experienced same issue when I upload txt files.
So I divided the file into 10+ part and tested them one by one, and spotted the problematic part. When I exclude this part, there was no problem uploading the file.
But I don’t know thy this part is causing error. No clue.
I was having the same issue with my .txt file that the GPT Could not extract the text from my file. What I did that made it work afterwards was that I used this separator “~” instead of a comma “,”…