Recently noticed that retrieving query embeddings via API has become extremely slow, comparing to what it used to be.
I’m used to millisecond responses, now it takes 10-20-40 seconds or even more than a minute to retrieve.
I ran a quickie bit of code, sending just a list with one item, a short phrase, for ten trials against each model:
Model
Min (ms)
Max (ms)
Avg (ms)
text-embedding-3-large
560.55
1099.22
733.91
text-embedding-3-small
565.86
1332.11
832.81
text-embedding-ada-002
514.77
843.24
624.51
Seems to be no cases of a big delay.
I imagine that when you send more language, there can be more computational expense of token encoding and attention consumption, but I didn’t test huge texts or long lists of embeddings in one call.
Perhaps you can characterize what your are sending for one API call input for better replication and to avoid the “bad” case. Timeout and retry is also your friend.
Our solution is affected in the same way. Since yesterday, the response times of embedding models have increased significantly.
I don’t have systematic benchmarks from before yesterday, but based on observation, embeddings that used to take under one second can now take up to 30 seconds!
I tested using different VPNs, via cURL, and through the Python API, and observed the same effect. The embedding API has become unusable
We are also facing the same issue. Embedding API is taking inconsistent amount of time. Before it was always within 2-3 seconds but now response time varies from 3 seconds to 1-2 minutes.
We can flag the issue to be passed along to OpenAI’s channel here if it seems to be an issue that would take manual resolution and “fixing” for either the platform or a large section of developers.
Being able to replicate the issue and the conditions that cause it is more likely to bring out successful investigation and a fix. Also the tier of the account organization may be relevant again.
It’s important to note in your report if you are using the OpenAI SDK’s default “retry” mechanism that internally will timeout on a non-responsive API call and make several retries itself without reporting an error if you don’t set a retries=0 parameter on the client.
I upped the amount sent to embeddings: the complete works of William Shakespeare as an input source. To mirror a typical usage, sending a list of 5 strings per API call, and strings of 4000 characters to mirror a chunk similar to search retrieval size done by Assistants or for AI RAG injection, unique per model trial. Direct from me to OpenAI on non-shared resources.
The performance improved briefly for one day, but unfortunately , the latency issues have returned and worsened significantly.
We are currently using the Python SDK (openai==1.61.1 ) with the embedding model text-embedding-3-large . However, even with the smaller embedding models, latency remains a major concern. I’ve also tested using direct CURL requests, but the latency problem persists.
For identical texts, latency was under 1 second just a week ago. Currently, latency varies significantly, averaging over 15 seconds, which is unacceptable for chatbot applications. Given that this situation isn’t reflected on the status page (https://status.openai.com/), using the embeddings API with such instability is practically impossible in a production environment.
I’m connecting to the API from Poland.
Please - help
Having a similar latency issue in Germany. I think it’s hard to generalise by running latency tests at a certain time. If you’re lucky you’ll get results quickly, if not you’ll wait longer. The real question is - how stable and robust is the API over long periods of time?