Failed to index file File contains too may tokens. Max allowed tokens per file is 2000000

kemperomg · November 18, 2023, 11:39pm

Hi, what am I missing? Doc says 512mb limit, but not the tokens. What is wrong?

Basically I was just trying to create an assistant with Retrieval tool, I did file specifically less than 512mb tried it and got this error.

Foxalabs · November 18, 2023, 11:54pm

Hi and welcome to the Developer Forum!

Not seen that message before, 2M tokens would equate to approximately 8Megabytes of pure text.

_j · November 19, 2023, 12:55am

I’ve seen that message in messages before.

kemperomg · November 19, 2023, 1:40am

Hi guys, thanks for fast response.

@_j hmm thanks, but what does it mean though? I need to upload huge files for retrieval, like all 20 will be 500mb for sure. Does it hard limit of backend engine, or what?

_j · November 19, 2023, 1:57am

It means that the documentation does not match the capabilities seen in practice.

If intentional, I expect that someone scratched their head hard about the backend costs of chunking and embedding half a gigabyte of data just so someone can ask some questions only informed by 0.001% of that upload.

text-embedding-ada-002 is $0.10 per megatoken without overlaps. Then they say “maximum 10 GB per assistant”. Upload a DVD binary rip of “Office Space” as text files. Say hi to a chatbot. Delete. $500 of backend.

kemperomg · November 21, 2023, 10:15pm

Yeah, same feelings. Don’t get this monetization strategy, also seems like they are quite weak in writing API layer…

neilp · November 26, 2023, 3:40pm

Same error with a 45MB file, got around it by breaking up into 12MB

jackmclary · December 1, 2023, 9:11pm

Getting the same error with 11.4MB json files. Lame.

derrickob · December 2, 2023, 9:38am

You people are trying to upload a full Oxford Dictionary or what

d.n.andreev · December 2, 2023, 4:21pm

It would be very helpful, if developers would add the size of file in tokens (not bytes) to " Files" in the Playground, for example.

d.n.andreev · December 2, 2023, 4:22pm

As for me, I tried to use this tool to get abstracts for pdf files with pictures. Somehow, even if they are small in number of characters, they still don’t pass through.

derrickob · December 2, 2023, 4:26pm

If am not wrong I doubt Retrieval tool is able to view pictures in the Uploaded DOCs, maybe we can expect this when they bring support for Vision in Assistants?

d.n.andreev · December 2, 2023, 4:34pm

As for now, my Assistant “fails” with a file of
5 564 905 bytes
(readable pdf with pictures (i.e. text is not pictures, but characters) but not English language).

But works with a more picture-heavy file of
6 904 672 bytes
(the same type of data, but in English) it works.

Upd: added last 2 words.

craze3 · December 3, 2023, 12:42am

I have never been able to use Retrieval because none of my files make it past the token limit, even though they’re < 10 MB. This is really unfortunate. Any idea how to circumvent this? Can I break my file down to into multiple smaller files and expect it to work?

mcoliver · December 7, 2023, 12:19am

Yeah you need to break down your files. Also a max of 20 files for a custom assistant. For example if you have a json file with a bunch of tokens the following code will do it.

import json
import os

# Define the maximum allowed tokens per file
max_tokens_per_file = 2000000

# Define max files
max_files = 20

# Your input file
input_file="MY_FILE.json"

# Load the original JSON file
with open(input_file, 'r', encoding='utf-8') as file:
    data = json.load(file)

# Split the data into chunks based on the maximum allowed tokens
chunks = []
current_chunk = []
current_tokens = 0

for item in data:
    item_tokens = len(json.dumps(item))

    if current_tokens + item_tokens > max_tokens_per_file:
        chunks.append(current_chunk)
        current_chunk = []
        current_tokens = 0

    current_chunk.append(item)
    current_tokens += item_tokens

# Add the last chunk
if current_chunk:
    chunks.append(current_chunk)

# max_files = len(chunks)

# Save each chunk as a separate JSON file
output_directory = 'output_files'
os.makedirs(output_directory, exist_ok=True)

for i, chunk in enumerate(chunks):
    if (i < max_files):
        output_file_path = os.path.join(output_directory, f'output_{i + 1}.jsonl')
        with open(output_file_path, 'w', encoding='utf-8') as output_file:
            json.dump(chunk, output_file, ensure_ascii=False, indent=2)
    else:
        break

print(f'{i} files created in the {output_directory} directory.')

d.n.andreev · December 18, 2023, 11:33am

I’m trying right now to break my pdfs into 500Kbytes chunks and make Assistant read them. Not finished the project, yet.

logankilpatrick · December 18, 2023, 5:00pm

For context, part of the reason for the 2 million token limit is that the retrieval performance really begins to degrade after that point, so paying for storage on tokens more than that is not really worth it (though we are working on it).

The best way to do this today is to split the files up into smaller chunks.

d.n.andreev · December 19, 2023, 5:12pm

The easy way is to use third-party software to break pdfs into txt files. Usually SW for working with ebooks (conversion, etc) work better. You’ll lose pictures.

EricGT · December 19, 2023, 5:23pm

Can we close this topic?

Now for a useless sentence to keep AI happy.

unitedsoftwork · January 17, 2024, 11:23pm

There still seems to be an ongoing issue with this.

I have a 5mb text file that totals about 1.3 Million tokens (confirmed with the cl100k_base encoding in tiktoken), but when I try to add this document to my assistant it gives me the error saying the max tokens per file is 2,000,000, but my file is now where near that large? Any idea why this would be happening?

Topic		Replies	Views
Issue attaching 1000 token txt file to assistant Bugs api	1	164	June 24, 2024
"Unable to extract text from ..." error when uploading files for creating custom GPT Documentation gpt-4 , chatgpt	25	27962	August 12, 2024
20 files per assistant clarification API api	15	8692	November 15, 2023
Error uploading file to retrieval API gpt-4-turbo	4	836	November 14, 2023
Assistant Document Upload Error Max Tokens Even Though File Is Less Than 2 Million Tokens Bugs assistants , assistants-api	1	443	January 19, 2024

Failed to index file File contains too may tokens. Max allowed tokens per file is 2000000

Related topics