jwatte
1
I have a documentation search application using text-embedding-ada-002 to compute embeddings.
I submit documentation snippets in batches of 500 (each 500-2000 characters in length) and retrieve the embeddings.
This has worked fine for months, but, lately, I sometimes get [NaN] back for the embedding for some particular snippet. Trying again, it will generally return a good embedding for the same text.
This behavior is not documented, and it caused a fair bit of trouble in my application, until I found the problem and added checking/re-tries to the embedding computation.
- Is this behavior documented somewhere? If so, where?
- Is this behavior known/expected? If not, where do I report the bug?
- What else can I do to avoid this problem, rather than re-trying?
I’m using the Python API, now with the following code:
def retry_embeddings(nugtxt):
for i in range(7):
if i > 0:
time.sleep(1.0+0.3*2.5**i)
try:
emb = openai.embeddings_utils.get_embeddings(nugtxt, engine="text-embedding-ada-002")
if len(emb) == len(nugtxt):
ok = True
for em in emb:
if len(em) != 1536:
print(f"Embedding computed is wrong length: {len(em)}\nJSON: {json.dumps(em)}\n", flush=True)
ok = False
break
if ok:
return emb
print(f"len(emb)={len(emb)}; len(nugtxt)={len(nugtxt)}; re-trying", flush=True)
except Exception as x:
print(f"Exception in get_embeddings(): {x}\nmessage: {x.user_message}", flush=True)
raise Exception("retry_embeddings() still fails after retries")
nugtxt is a list of 500 text strings.
jwatte
3
Thanks for the answer.
I’m batching requests because I want to reduce latency.
Sure, I could batch 100 at a time, but then I’ll pay 5x the network call overhead…
I’ll poke at the Discord bug reports channel.
1 Like
sps
4
Hi @jwatte,
The 'engine' was deprecated in favor of 'model', which makes me think that this function may not be best suited for long-term.
Here’s how this function is currently defined in the package:
@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
def get_embeddings(
list_of_text: List[str], engine="text-similarity-babbage-001", **kwargs
) -> List[List[float]]:
assert len(list_of_text) <= 2048, "The batch size should not be larger than 2048."
# replace newlines, which can negatively affect performance.
list_of_text = [text.replace("\n", " ") for text in list_of_text]
data = openai.Embedding.create(input=list_of_text, engine=engine, **kwargs).data
return [d["embedding"] for d in data]
I would instead recommend using:
emb = openai.Embedding.create(model="text-embedding-ada-002", input=nugtxt)
if len(emb['data']) == len(nugtxt):
...
Also, keep in mind that text-embedding-ada-002 has a token limit of 8191 tokens, so keep the length of your individual items lower than that.
Here’s the latest documentation on embeddings.
3 Likes
jwatte
5
Renaming the argument seems like a good idea for when I update the code.
I doubt that this argument would cause the occasional NaN, though – if it didn’t work, it wouldn’t work at all!
The token limit is per item, whereas the batch then has a number of items. This is why I keep each text string to 500-2000 characters. (Well, this, and the fact that I get better match results when chunking to that size.)
1 Like
sps
6
Thank for the info.
Does it happen with a specific batch of strings or randomly with any batch?
1 Like
jwatte
7
It is entirely random, and not very frequent.
Re-trying the same (full) batch when it happens, generally works the second time.
Sometimes, it will happen to two strings within the same batch.
1 Like
I’m having the same issue, it seems to appear randomly but only during large requests.
I can’t include links here for some reason but there are other people with the same issue if you search for “ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape” on stackoverflow and on langchains github issue tracker.
2 Likes
emcd
9
Also seeing this. My experience is that there tend to be 4 or 5 [nan] embeddings out of every batch of 2048. And, as others note, the failures are random. I.e., for the exact same inputs, you will see different [nan] embeddings on each run; the failures do not correlate with input.
(If I had to make a wild guess, probably each embedding is being handled by a separate process and the fanout/scheduling time is exceeding a total runtime deadline for processing the batch in some cases, when an embedding job experiences a scheduling delay. Statistically, larger batches would be more sensitive to this kind of failure mode.)
2 Likes