Embedding model token limit exceeding limit while using batch requests

I’m trying to upload an array of texts to the OpenAI Embedding API using the text-embedding-ada-002 model, which should have a token limit of 8191, but it sometimes tells me I have gone over the limit even though I am not.

I’m currently on ruby, so I’m using the tiktoken_ruby gem to count tokens before sending out the batched request.

I even tried lowering the token size of each array to be way lower, like 5000 tokens, but it’s telling me the size of tokens I sent is 8500+.

Am I correctly assuming that the token limit for the batched request is cumulative of every element in the array? and not individual elements in an array?

Here is the ruby code I am using to perform the task:

vector_response = openai_client.embeddings(
                    parameters: {
                        model: "text-embedding-ada-002",
                        input: batched_inputs
                    }
                )

batched_inputs is an array of code in string format

and this is the response I get:

{"error"=>{"message"=>"This model's maximum context length is 8191 tokens, however you requested 8589 tokens (8589 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.", "type"=>"invalid_request_error", "param"=>nil, "code"=>nil}}

I’m guessing that the token counting on tiktoken_ruby is wrong? I am using the encoding for text-embedding-ada-002 like so for counting.

module TikToken
  extend self

  DEFAULT_MODEL = "text-embedding-ada-002"

  def self.count(string, model: DEFAULT_MODEL)
    get_tokens(string, model: model).length
  end

  def get_tokens(string, model: DEFAULT_MODEL)
    encoding = Tiktoken.encoding_for_model(model)
    tokens = encoding.encode(string)
    tokens.map do |token|
      [token, encoding.decode([token])]
    end.to_h
  end
end

Any help to point me in the right direction would be appreciated.

Every input to the embeddings endpoint is seen as one text. There is no “batch processing” that you don’t loop yourself - a list or dictionary of elements is simply seen as that complete formatted text, returning one vector embeddings for the entire input. It will also have the overhead of the container - the quotes and brackets of strings, keys, values, etc.

text-embedding-ada-002 uses the cl100k_base tokenizer, the same as GPT-4 and 3.5. This should be the most efficient compared to prior GPT-3 embeddings models, but you can evaluate the actual submission size independently.

Here’s some embeddings code, somewhat obfuscated by the use of pandas python library for data analysis. It counts tokens and drops inputs that are too long (without noting the errors).

Consider your use case - what will you do with those 8k token chunks that are retrievable by an embeddings semantic match? A bit too big to submit to another AI as knowledge.

Thanks for the prompt reply _j!

Ah, so an array input is just see as one input and one embedding comes back from the API?

It seemed like multiple inputs and multiple embedding responses were possible according to the docs.

input
string or array
Required
Input text to embed, encoded as a string or array of tokens. To embed multiple inputs in a single request, pass an array of strings or array of token arrays. Each input must not exceed the max input tokens for the model (8191 tokens for text-embedding-ada-002). Example Python code for counting tokens.

I seem to be getting separate embedding vectors for each input I put in with small tests, though.

batched_inputs = ["first", "second", "third"]

vector_response = openai_client.embeddings(
                    parameters: {
                        model: "text-embedding-ada-002",
                        input: batched_inputs
                    }
                )

puts vector_response["data"].length

Which returns 3

You seem to have found the one place that such a behavior is documented. To me, it would actually be undesirable or unexpected.

To embed multiple inputs in a single request, pass an array of strings or array of token arrays. Each input must not exceed the max input tokens for the model

For example, say I want to embed a list of four user questions to a chatbot, that I have already tagged by conversation category, for retrieval of that chat flow category on unseen chat history. 4-in-1-out. I naively send four strings in a python list (aka array) - I get four embeddings instead of the conversation topic? Vectors of a question “really?” don’t do me much good.

Since you pay by input token, you only have the advantage of sending fewer requests. The array container adds more tokens than just the multiple strings in multiple requests, even if merely adding a comma separating them.

You can extract the whole input right before it is submitted and tokenize that.

(here’s another thought to ponder: since the limit you are hitting indicates that all input is loaded into the single context length of the embedding model, is each array member truly seen as independent, returning their own embedding, or would they build on each other, returning the embedding state up to the point of the next cumulative input? Either seems possible, each with a scenario where the behavior could be employed…)

I’m submitting potentially hundreds of files for embedding per user request, so I’m more doing batching as a way to speed up processing time.

So, if i’m understanding correctly, an array would have a higher token overhead than just a string? Hmmm

It seems like the token count usage is as expected, for example:

batched_inputs = ["first", "second", "third"]

returns total_tokens as 3 from the response and

batched_inputs = ["first", "second", "third", "third", "fifth"]

returns total_tokens from the response as 5

I’m still baffled by why a list of text with a total of 5000~ tokens respond with a token limit error, though… so i’m not sure

Edit:

After a bit more testing, maybe it has to do with special characters and extra spaces?

I’m making embeddings of programming code, so the text includes \n characters and multiple spaces that represents tabs. Maybe those are counted differently.

batched_inputs = ["\n hi \n chicken  food \n"]
TikToken.count(batched_inputs)

returns 6 tokens

but the vector response from openai returns 7 tokens, so that might be where the discrepancy is coming from… hmm

Yup, the multiple spaces were the problem.

I just took my inputs and replaced all consecutive spaces greater than 1 to have just one space.

Might cause problems later on, but now the code is working.

Thank you @_j for walking through that with me!

1 Like

Hey, I was reading through this and was curious if removing newline (\n) characters or extra spaces (tabs) wound up messing the training up.? How did it work for you?