Embedding models with an array of integers input

gshy2014 · August 9, 2024, 1:18am

I notice that openai embedding API can take an array of integers or an array of arrays of integers as input (Cf: https://platform.openai.com/docs/api-reference/embeddings/create#embeddings-create-input). Are there any assumptions about how this array of integers are generated? Otherwise, how does it make sense to embedding a bunch of indices without knowing what they are?

vasyl · August 9, 2024, 2:58am

Hi @gshy2014 , I’ve checked the url and didn’t find anything similar to what you’re saying. I think you’d have to elaborate on what you’re looking for and what’s exactly that you can not find or understand.
It seems you might have a misconception about how embedding models function. These models process information and generate an embedding vector—essentially an array of numbers—that captures the underlying meaning and relationships of words, numbers, images, and other data types.

Diet · August 9, 2024, 5:03am

Welcome to the community!

I think you’re talking about this

The integers here are actually token IDs. You get these token ids by using the correct tokenizer model (GitHub - openai/tiktoken: tiktoken is a fast BPE tokeniser for use with OpenAI's models.).

Most devs just use the tokenizer to calculate costs, but OpenAI uses the tokenizer as the first stage in their models (it’s probably a one hot encoding) before getting to the actual meat of the transformer.

You can theoretically do the tokenization step yourself and pass the results as an array, but you, as a developer, gain pretty much nothing from that

Theoretically potentially the JSON payload might end up being smaller if you want to save bandwidth because it might compress better. But realistically probably not really.

It’s a cool find, the practical uses are probably pretty niche.

gshy2014 · August 10, 2024, 7:57pm

Thanks for the explanation. It makes sense. Is the use of integer array (ids from tiktoken) as input to the API documented anywhere? Or is it only intended for internal use ?

Topic		Replies	Views
Does openAI provide API that takes Embeddings as an input? API embeddings	10	4574	December 18, 2023
Embedding tokens vs embedding strings? API	12	8784	February 11, 2024
Using a Custom Tokenizer with GPT Embeddings API	5	4177	March 4, 2024
Embeddings for tokens used by GPT models? API	2	985	December 17, 2023
Question about embeddings (ada 002) with numeric values API	7	2802	December 17, 2023

Embedding models with an array of integers input

Related topics