"Unable to extract text from ..." error when uploading files for creating custom GPT

tomjoht · November 13, 2023, 8:53pm

I’m trying to upload files to create a custom GPT. I am able to upload PDFs fine, but the larger challenge is that I have 3,000 blog posts that I want to upload. I used a python script to combine the source files (it’s a Jekyll site) into a single file that tried uploading it into the GPT interface, but I get an error that says “Unable to extract text from {file name}”.

I couldn’t find the documentation for creating a custom GPT. What are the file size limitations (max size, max number)? Also, what’s the optimal encoding? HTML, Markdown, plain text? What kind of output will be best parsed? Should I generate all my blog posts into several massive PDFs to upload? Are multiple files better than single files? Why do I keep getting the error about being unable to extract text? It would be great to have some kind of guidance related to the file upload part of the custom GPT interface.

Foxalabs · November 13, 2023, 9:07pm

Hi and welcome to the Developer Forum!

GPT’s can have 10 files max, not sure on the file size at this moment in time.

File encoding of markdown, CSV, or the like is fine, if you have something like XML and json and the file is not correctly structured that will cause file upload errors, so a simple csv or HTML generated from a page view might be best.

tomjoht · November 13, 2023, 9:23pm

Thanks! I tried converting all the source file format to markdown – still got the error. I tried stripping out all the html and markdown formatting except for headings – still got the error.

BTW, is there official documentation anywhere on creating custom GPTs?

PaulBellow · November 13, 2023, 9:28pm

How big is the resulting file of the blog posts? Might have to split it?

numegil · November 13, 2023, 9:29pm

I’m getting the same error from a single text file of 283 bytes (no other files uploaded), so I don’t think file size is the issue here.

carson.morris2 · November 13, 2023, 9:30pm

Its still pretty new so there’s not too much documentation for Chat GPT Assistants but use the API Assistants as your rough guide as this is what Chat GPT GPT’s use but with less features.

Max file size is 512 MB but it should tell you if that is the issue.

OpenAI Platform.

Try passing a big chunk of it through the GPT chat and tell it to save it as a file and see how it saves it. maybe the HTML needs to be exited in the file.

Surround the file in ``` on either side so it knows its a code block

tomjoht · November 13, 2023, 9:32pm

I tried various approaches:

all posts in one text file is surprisingly only 15 MB. but leads to error upon upload.
split out posts by year (starting in 2006). some upload fine, others do not.
tried converting html to markdown. same error.
tried stripping out formatting. also error.

I had my best luck in uploading a massive PDF (1,000 pages), so maybe that’s what I’ll try again.

I’m a little baffled by the lack of documentation for this feature. Thanks for the tip about using API Assistants docs.

Foxalabs · November 13, 2023, 9:33pm

It would be super helpful to have this nailed down, hopefully the ``` code block markers fix it, but I am able to upload pdf’s and html files all day long with no problems but some files cause issues, would be great to know the cause/solution/reasoning behind this.

tomjoht · November 13, 2023, 10:06pm

I’m guessing the file is too big, even though it meets the file size requirements. for example, pasting this into Google Docs makes it hang. I’m going to try limiting the file to 2MB and converting to PDF. Will report back.

alvetr0 · November 13, 2023, 10:30pm

Maybe you could try using langchain to achieve this? Create the code in python and add chroma as a vector store. So each time you need info from those files langchain it’s going to tokenize your info so you can querying later.

You can create this code (OpenAPI Schema & API call) using this GPT:

API Alchemist URL

tomjoht · November 14, 2023, 6:18am

The PDF approach worked. Here’s what I did:

used a liquid script to generate all the post content as I built my jekyll site locally. note: The output was so large it wouldn’t load in a browser window.
converted the HTML to PDF using Prince, stripping out sidebars, headers, etc.
split the content into 1000 page chunks using pdftk
uploaded the files (about 7 files)

Seems to work. Sometimes an upload would fail the first time, but work on the second upload attempt.

tomjoht · November 14, 2023, 1:32pm

On Twitter someone said, “I learnt that knowledge files for custom GPTs shall be 8000 tokens. It accepts multiple files but recommends to combine info into 8000 tokens. 8000 tokens is roughly about 7000 words.” I’m not sure where that info came from, but this got me thinking. What if OpenAI added some documentation for one of their most talked about features at DevDay? I could be wrong, but I seem to recall that there’s a role of people who do just that – write documentation for features being released. There’s a name for the role. Can’t quite remember. … Just kidding, this is me being sarcastic, as I’m a technical writer. It’s a major flop for any company to release a feature of this scale with no documentation.

alexlovestocode · November 15, 2023, 4:15pm

I had a similar error using the API. ‘message’: 'Failed to index file: Error extracting text from file file-k6B5yWaA9ZnDKyoWfPxvIBz4, detail: File contains too may tokens. Max allowed tokens per file is 2000000. Seems like your blog posts would be more than 2 million tokens. Hope that helps

rbritom · November 15, 2023, 4:28pm

I don’t think that the 8000 tokens limit is true. Ada has an 8000 tokens limit, but the idea with the new product, GPT, and the Assistant API, is that they chunk the file into ada compatible chunk sizes and then index the vector for retrieval, if that is not the case then it would not make sense.

rk9409 · November 18, 2023, 8:45am

I think “.txt” files are not suported.

meet-marty · November 20, 2023, 2:17pm

I agree, I started with JSON, got a weird error I thought was related to tokenization so uploaded a .txt and got the same 2,000,000 token warning. Converted the .txt to a PDF and still got the error. (This was in the playground and API FYI)

Edit: I was only able to get this working by manually calculating token length and saving file parts under the limit. I’m working in node, pseudo code below

// productCatalog is an array of strings containing product info
let productsForCurrentFile = [];
let currentTotalTokenLength = 0;
let currentFileIndex = 1;
for (let i = 0; i <= productCatalog.length; i++) {
     const product = productCatalog[i];
     const productString = buildProductInformationString(product) // some custom function to produce the information you want to have in the model
     const p = product.split(""); // create an array of char
     const productTokenLengthEstimate = ( (p.length) * 0.36788) + 2;
     // = total characters * (1/e) + buffer
     if (currentTotalTokenLength > 1955000) { // again more imprecision lol
             // write out file
            const productsToWrite = productsForCurrentFile.join('');
            const filePath = `training/productCatalog_${currentFileIndex}.txt`;
            fs.writeFileSync(filePath, productsToWrite);
            currentFileIndex++;
            productsForCurrentFile = [];
            currentTokenLength = 0;
    } else {
      productsForCurrentFile.push(product);
    }
}

Just an example, you’d need to customize it for the case that the data you’re processing is under the limit (but thats not what we’re here talking about)

rendi2 · November 21, 2023, 8:25am

I experienced same issue when I upload txt files.
So I divided the file into 10+ part and tested them one by one, and spotted the problematic part. When I exclude this part, there was no problem uploading the file.
But I don’t know thy this part is causing error. No clue.

tahmiedhossain4671 · November 23, 2023, 7:27am

You can find the solution wordpressjournal. please check that

seven777sky · November 25, 2023, 3:13am

It’ s too big,i convert txt todocx,got a other error

adriano.neto · November 28, 2023, 2:47pm

I was having the same issue with my .txt file that the GPT Could not extract the text from my file. What I did that made it work afterwards was that I used this separator “~” instead of a comma “,”…

I hope this helps

Topic		Replies	Views
Failed to index file File contains too may tokens. Max allowed tokens per file is 2000000 API api , assistants	23	4601	March 2, 2024
Problem uploading .txt files to custom GPT GPT builders	1	663	March 25, 2024
Custom GPT isn't accessing information in text files? Plugins / Actions builders custom-gpt , gpts , file-uploads	11	3696	February 19, 2024
20 files per assistant clarification API api	15	8965	November 15, 2023
GPTs knowledge capacity limits Plugins / Actions builders custom-gpt , gpts , chatgpt-gpt , tp-1	76	54120	January 5, 2024

"Unable to extract text from ..." error when uploading files for creating custom GPT

Related topics