What is the tokenizer used for openai text-embedding-3-large?

The tokenizer used for text-embedding-ada-002 was cl100k_base. What is the tokenizer used for the new embedding model openai text-embedding-3-large ?

Also, anyone have any feedback on it’s performance so far?

1 Like

cl100k_base

https://platform.openai.com/docs/guides/embeddings/how-can-i-tell-how-many-tokens-a-string-has-before-i-embed-it

Performance thread: It looks like 'text-embedding-3' embeddings are truncated/scaled versions from higher dim version

TLDR: :thinking:

I don’t know if there’s a TLDR yet, it’s complicated. They’re certainly different. I do recommend you check out the thread! :laughing:

edit: another eval thread: New OpenAI Announcement! Updated API Models and no more lazy outputs

2 Likes