I notice that openai embedding API can take an array of integers or an array of arrays of integers as input (Cf: https://platform.openai.com/docs/api-reference/embeddings/create#embeddings-create-input). Are there any assumptions about how this array of integers are generated? Otherwise, how does it make sense to embedding a bunch of indices without knowing what they are?
Hi @gshy2014 , I’ve checked the url and didn’t find anything similar to what you’re saying. I think you’d have to elaborate on what you’re looking for and what’s exactly that you can not find or understand.
It seems you might have a misconception about how embedding models function. These models process information and generate an embedding vector—essentially an array of numbers—that captures the underlying meaning and relationships of words, numbers, images, and other data types.
Welcome to the community!
I think you’re talking about this
The integers here are actually token IDs. You get these token ids by using the correct tokenizer model (GitHub - openai/tiktoken: tiktoken is a fast BPE tokeniser for use with OpenAI's models.).
Most devs just use the tokenizer to calculate costs, but OpenAI uses the tokenizer as the first stage in their models (it’s probably a one hot encoding) before getting to the actual meat of the transformer.
You can theoretically do the tokenization step yourself and pass the results as an array, but you, as a developer, gain pretty much nothing from that
Theoretically potentially the JSON payload might end up being smaller if you want to save bandwidth because it might compress better. But realistically probably not really.
It’s a cool find, the practical uses are probably pretty niche.
Thanks for the explanation. It makes sense. Is the use of integer array (ids from tiktoken) as input to the API documented anywhere? Or is it only intended for internal use ?