Hello - What is the tokenizer used by Assistant API created using GPT4o. When i retrieve the file chunks and I check the token count, it is always less than my static chunking strategy mentioned.
For example, if i write static chunking parameter set to 1000 token and see the returned chunk as part of file search, I get chunking less than 1000 token when i check this site- https://platform.openai.com/tokenizer
Any idea what this additional token context used is. I am doing Contextual Retrieval on pre processing. That means adding context to chunk and making both the context and chunk exactly 1000 token using padding… and then when i do static chunking on same count, and do file search… the returned chunks are all messed up.
With any luck, the chunking strategy has some quality of document structure or contents for where it splits. A lack of quality may have been a reason why the output was undisclosed for so long.
Splitting at actual “tokens” is likely incorrect, and only estimated, because you can send a vector store to either gpt-4-turbo with a cl100k tokenizer or a gpt-4o o200k tokenizer – and are they going to make a separate data store for each model?
Token-level splitting also is hard, because you can produce invalid unicode either at the start or end by truncating byte sequences. At best, sending invalid bytes as byte tokens to the AI will be junk it can ignore. That is beyond words being chopped in half.