Text-embedding-ada-002 embeddings sometime return NaN

I have a documentation search application using text-embedding-ada-002 to compute embeddings.
I submit documentation snippets in batches of 500 (each 500-2000 characters in length) and retrieve the embeddings.
This has worked fine for months, but, lately, I sometimes get [NaN] back for the embedding for some particular snippet. Trying again, it will generally return a good embedding for the same text.

This behavior is not documented, and it caused a fair bit of trouble in my application, until I found the problem and added checking/re-tries to the embedding computation.

  1. Is this behavior documented somewhere? If so, where?
  2. Is this behavior known/expected? If not, where do I report the bug?
  3. What else can I do to avoid this problem, rather than re-trying?

I’m using the Python API, now with the following code:

def retry_embeddings(nugtxt):
    for i in range(7):
        if i > 0:
            time.sleep(1.0+0.3*2.5**i)
        try:
            emb = openai.embeddings_utils.get_embeddings(nugtxt, engine="text-embedding-ada-002")
            if len(emb) == len(nugtxt):
                ok = True
                for em in emb:
                    if len(em) != 1536:
                        print(f"Embedding computed is wrong length: {len(em)}\nJSON: {json.dumps(em)}\n", flush=True)
                        ok = False
                        break
                if ok:
                    return emb
            print(f"len(emb)={len(emb)}; len(nugtxt)={len(nugtxt)}; re-trying", flush=True)
        except Exception as x:
            print(f"Exception in get_embeddings(): {x}\nmessage: {x.user_message}", flush=True)
    raise Exception("retry_embeddings() still fails after retries")

nugtxt is a list of 500 text strings.

Hi Jon,

  1. Not sure it’s documented specifically, but as with any API call, best practice is to assume communication/server side issues and validate your return data aggressively.

  2. You are welcome to create a bug report in #bug-reports on the OpenAI Discord server here OpenAI or this post will get looked at by various members of the OpenAI dev team.

  3. You might get better results with smaller chunks, but that is a total guess and just going off the old “less time in flight” adage.

Thanks for the answer.
I’m batching requests because I want to reduce latency.
Sure, I could batch 100 at a time, but then I’ll pay 5x the network call overhead…

I’ll poke at the Discord bug reports channel.

1 Like

Hi @jwatte,
The 'engine' was deprecated in favor of 'model', which makes me think that this function may not be best suited for long-term.

Here’s how this function is currently defined in the package:

@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
def get_embeddings(
    list_of_text: List[str], engine="text-similarity-babbage-001", **kwargs
) -> List[List[float]]:
    assert len(list_of_text) <= 2048, "The batch size should not be larger than 2048."

    # replace newlines, which can negatively affect performance.
    list_of_text = [text.replace("\n", " ") for text in list_of_text]

    data = openai.Embedding.create(input=list_of_text, engine=engine, **kwargs).data
    return [d["embedding"] for d in data]

I would instead recommend using:

emb = openai.Embedding.create(model="text-embedding-ada-002", input=nugtxt)
if len(emb['data']) == len(nugtxt):
...

Also, keep in mind that text-embedding-ada-002 has a token limit of 8191 tokens, so keep the length of your individual items lower than that.

Here’s the latest documentation on embeddings.

3 Likes

Renaming the argument seems like a good idea for when I update the code.
I doubt that this argument would cause the occasional NaN, though – if it didn’t work, it wouldn’t work at all!

The token limit is per item, whereas the batch then has a number of items. This is why I keep each text string to 500-2000 characters. (Well, this, and the fact that I get better match results when chunking to that size.)

1 Like

Thank for the info.

Does it happen with a specific batch of strings or randomly with any batch?

1 Like

It is entirely random, and not very frequent.
Re-trying the same (full) batch when it happens, generally works the second time.
Sometimes, it will happen to two strings within the same batch.

1 Like

I’m having the same issue, it seems to appear randomly but only during large requests.
I can’t include links here for some reason but there are other people with the same issue if you search for “ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape” on stackoverflow and on langchains github issue tracker.

2 Likes

Also seeing this. My experience is that there tend to be 4 or 5 [nan] embeddings out of every batch of 2048. And, as others note, the failures are random. I.e., for the exact same inputs, you will see different [nan] embeddings on each run; the failures do not correlate with input.

(If I had to make a wild guess, probably each embedding is being handled by a separate process and the fanout/scheduling time is exceeding a total runtime deadline for processing the batch in some cases, when an embedding job experiences a scheduling delay. Statistically, larger batches would be more sensitive to this kind of failure mode.)

2 Likes