Tokenizer used by Assistant API for chunking not correct!

rohitgulia02 · November 2, 2024, 11:55am

Hello - What is the tokenizer used by Assistant API created using GPT4o. When i retrieve the file chunks and I check the token count, it is always less than my static chunking strategy mentioned.

For example, if i write static chunking parameter set to 1000 token and see the returned chunk as part of file search, I get chunking less than 1000 token when i check this site- https://platform.openai.com/tokenizer

Please HELP !!!

Foxalabs · November 2, 2024, 12:01pm

The 4o model uses o200k_base tokenizer.

Note that the assistants API may make use of additional tokens for context if you have files stored that it uses for retrievals.

rohitgulia02 · November 2, 2024, 12:08pm

Any idea what this additional token context used is. I am doing Contextual Retrieval on pre processing. That means adding context to chunk and making both the context and chunk exactly 1000 token using padding… and then when i do static chunking on same count, and do file search… the returned chunks are all messed up.

_j · November 2, 2024, 12:12pm

With any luck, the chunking strategy has some quality of document structure or contents for where it splits. A lack of quality may have been a reason why the output was undisclosed for so long.

Splitting at actual “tokens” is likely incorrect, and only estimated, because you can send a vector store to either gpt-4-turbo with a cl100k tokenizer or a gpt-4o o200k tokenizer – and are they going to make a separate data store for each model?

Token-level splitting also is hard, because you can produce invalid unicode either at the start or end by truncating byte sequences. At best, sending invalid bytes as byte tokens to the AI will be junk it can ignore. That is beyond words being chopped in half.

Curious fact about tokens:

So I would not expect true perfect token count chunks out of this system.

Topic		Replies	Views
Assistant API + gpt4o + filesearch uses more tokens then gpt3.5 API assistants-api	1	239	July 5, 2024
Token Budget in File Search tool API assistants-api	2	431	June 7, 2024
Understanding AI Assistant input token counts Prompting gpt-4 , lost-user , assistants-api	5	3398	June 26, 2024
Assistant's Retrieval Chunks in Playground: Can the Size be Controlled? API assistants	1	1396	November 18, 2023
Large JSON Responses from Assistant API are truncated API json , assistants-api	5	1510	June 20, 2024

Tokenizer used by Assistant API for chunking not correct!

Related topics