Failed to index file File contains too may tokens. Max allowed tokens per file is 2000000

Hi, what am I missing? Doc says 512mb limit, but not the tokens. What is wrong?

Basically I was just trying to create an assistant with Retrieval tool, I did file specifically less than 512mb tried it and got this error.

2 Likes

Hi and welcome to the Developer Forum!

Not seen that message before, 2M tokens would equate to approximately 8Megabytes of pure text.

Iā€™ve seen that message in messages before.

Hi guys, thanks for fast response.

@_j hmm thanks, but what does it mean though? I need to upload huge files for retrieval, like all 20 will be 500mb for sure. Does it hard limit of backend engine, or what?

It means that the documentation does not match the capabilities seen in practice.

If intentional, I expect that someone scratched their head hard about the backend costs of chunking and embedding half a gigabyte of data just so someone can ask some questions only informed by 0.001% of that upload.

text-embedding-ada-002 is $0.10 per megatoken without overlaps. Then they say ā€œmaximum 10 GB per assistantā€. Upload a DVD binary rip of ā€œOffice Spaceā€ as text files. Say hi to a chatbot. Delete. $500 of backend.

1 Like

Yeah, same feelings. Donā€™t get this monetization strategy, also seems like they are quite weak in writing API layerā€¦

1 Like

Same error with a 45MB file, got around it by breaking up into 12MB

Getting the same error with 11.4MB json files. Lame.

1 Like

You people are trying to upload a full Oxford Dictionary or what :sweat_smile:

1 Like

It would be very helpful, if developers would add the size of file in tokens (not bytes) to " Files" in the Playground, for example.

1 Like

As for me, I tried to use this tool to get abstracts for pdf files with pictures. Somehow, even if they are small in number of characters, they still donā€™t pass through.

If am not wrong I doubt Retrieval tool is able to view pictures in the Uploaded DOCs, maybe we can expect this when they bring support for Vision in Assistants?

As for now, my Assistant ā€œfailsā€ with a file of
5 564 905 bytes
(readable pdf with pictures (i.e. text is not pictures, but characters) but not English language).

But works with a more picture-heavy file of
6 904 672 bytes
(the same type of data, but in English) it works.

Upd: added last 2 words.

I have never been able to use Retrieval because none of my files make it past the token limit, even though theyā€™re < 10 MB. This is really unfortunate. Any idea how to circumvent this? Can I break my file down to into multiple smaller files and expect it to work?

1 Like

Yeah you need to break down your files. Also a max of 20 files for a custom assistant. For example if you have a json file with a bunch of tokens the following code will do it.

import json
import os

# Define the maximum allowed tokens per file
max_tokens_per_file = 2000000

# Define max files
max_files = 20

# Your input file
input_file="MY_FILE.json"

# Load the original JSON file
with open(input_file, 'r', encoding='utf-8') as file:
    data = json.load(file)

# Split the data into chunks based on the maximum allowed tokens
chunks = []
current_chunk = []
current_tokens = 0

for item in data:
    item_tokens = len(json.dumps(item))

    if current_tokens + item_tokens > max_tokens_per_file:
        chunks.append(current_chunk)
        current_chunk = []
        current_tokens = 0

    current_chunk.append(item)
    current_tokens += item_tokens

# Add the last chunk
if current_chunk:
    chunks.append(current_chunk)

# max_files = len(chunks)

# Save each chunk as a separate JSON file
output_directory = 'output_files'
os.makedirs(output_directory, exist_ok=True)

for i, chunk in enumerate(chunks):
    if (i < max_files):
        output_file_path = os.path.join(output_directory, f'output_{i + 1}.jsonl')
        with open(output_file_path, 'w', encoding='utf-8') as output_file:
            json.dump(chunk, output_file, ensure_ascii=False, indent=2)
    else:
        break

print(f'{i} files created in the {output_directory} directory.')

1 Like

Iā€™m trying right now to break my pdfs into 500Kbytes chunks and make Assistant read them. Not finished the project, yet.

For context, part of the reason for the 2 million token limit is that the retrieval performance really begins to degrade after that point, so paying for storage on tokens more than that is not really worth it (though we are working on it).

The best way to do this today is to split the files up into smaller chunks.

3 Likes

The easy way is to use third-party software to break pdfs into txt files. Usually SW for working with ebooks (conversion, etc) work better. Youā€™ll lose pictures.

Can we close this topic?


Now for a useless sentence to keep AI happy.

There still seems to be an ongoing issue with this.

I have a 5mb text file that totals about 1.3 Million tokens (confirmed with the cl100k_base encoding in tiktoken), but when I try to add this document to my assistant it gives me the error saying the max tokens per file is 2,000,000, but my file is now where near that large? Any idea why this would be happening?

1 Like