Tokenizer used by Assistant API for chunking not correct!

Hello - What is the tokenizer used by Assistant API created using GPT4o. When i retrieve the file chunks and I check the token count, it is always less than my static chunking strategy mentioned.

For example, if i write static chunking parameter set to 1000 token and see the returned chunk as part of file search, I get chunking less than 1000 token when i check this site- https://platform.openai.com/tokenizer

Please HELP !!!

The 4o model uses o200k_base tokenizer.

Note that the assistants API may make use of additional tokens for context if you have files stored that it uses for retrievals.

2 Likes

Any idea what this additional token context used is. I am doing Contextual Retrieval on pre processing. That means adding context to chunk and making both the context and chunk exactly 1000 token using padding… and then when i do static chunking on same count, and do file search… the returned chunks are all messed up.

With any luck, the chunking strategy has some quality of document structure or contents for where it splits. A lack of quality may have been a reason why the output was undisclosed for so long.

Splitting at actual “tokens” is likely incorrect, and only estimated, because you can send a vector store to either gpt-4-turbo with a cl100k tokenizer or a gpt-4o o200k tokenizer – and are they going to make a separate data store for each model?

Token-level splitting also is hard, because you can produce invalid unicode either at the start or end by truncating byte sequences. At best, sending invalid bytes as byte tokens to the AI will be junk it can ignore. That is beyond words being chopped in half.

Curious fact about tokens:

So I would not expect true perfect token count chunks out of this system.

3 Likes