Extremely long request times- Completions API gpt-4o

  • Top 5% of requests are taking >30s
  • About our setup
    • gpt-4o-2024-08-06
    • Nodejs create completions api
    • We make multiple chat completion requests in parallel
    • Use function calling
    • These requests are successful
    • Usage is well below our openAI rate limits

Usage
We are tier 5 and well within the rate limits:

Model RPM TPM Batch Queue Limit
gpt-4o 10,000 30,000,000 5,000,000,000

Example response (>100s request time):

{
  "id": "chatcmpl-AaOSwrmyRWJmN59j9vg8tyb0SJwZR",
  "object": "chat.completion",
  "created": 1733237218,
  "model": "gpt-4o-2024-08-06",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": null,
        "tool_calls": [
          {
            "id": "call_ZVL3JKwqrB39HFU4cjr1IsmU",
            "type": "function",
            "function": {
              "name": "createSummary",
              "arguments": "{\"title\":\"The Feuding Families of Verona\",\"summaries\":[{\"subheader\":\"The Montague-Capulet Feud\",\"bulletPoints\":[\"The longstanding feud between the Montagues and Capulets disrupts the peace in Verona.\"]},{\"subheader\":\"The Street Brawl\",\"bulletPoints\":[\"A street fight breaks out between the servants of the Montague and Capulet households.\"]},{\"subheader\":\"The Prince's Decree\",\"bulletPoints\":[\"Prince Escalus declares that further disturbances will be punished by death.\"]},{\"subheader\":\"Romeo's Melancholy\",\"bulletPoints\":[\"Romeo is introduced as a lovesick young man, pining for Rosaline.\"]},{\"subheader\":\"Benvolio's Advice\",\"bulletPoints\":[\"Benvolio advises Romeo to forget Rosaline and look at other women.\"]}]}"
            }
          }
        ],
        "refusal": null
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 671,
    "completion_tokens": 168,
    "total_tokens": 839,
    "prompt_tokens_details": {
      "cached_tokens": 0,
      "audio_tokens": 0
    },
    "completion_tokens_details": {
      "reasoning_tokens": 0,
      "audio_tokens": 0,
      "accepted_prediction_tokens": 0,
      "rejected_prediction_tokens": 0
    }
  },
  "system_fingerprint": "fp_7f6be3efb0"
}

Welcome to the community forums! Good to have you.

Just for clarification,

  1. you’re not using streaming, right?
  2. with latency, you’re not referring to time-to-first-token, but rather just request → response time, is that correct?

This might be pretty concerning, because that might imply that a lot of people using streaming are getting billed for what they would consider timeout errors :thinking:

1 Like

Hi thanks for the response

  1. We are not streaming
  2. Yes, request → response time. We are using a proxy monitoring platform and supposedly the time-to-fist-token is 0ms but I’d take that with a grain of salt. Worth saying these issues were happening before we started using the proxy.

Any advice on how we could go about debugging this? Since it seems random, we considered killing the connection after a certain time limit and retrying but that would still lead to long response times.

May I ask which platform you are using to monitor this? I have also been observing much higher gpt latencies recently and trying to get to the bottom of it. In fact, I use the python client library, and I notice some calls even seem to not return after two minutes!

Same problems here with assistants, starting today. For example, retrieving information from existing threads can take several seconds to tens of seconds, similarly when creating threads etc.

I am facing the same exact problem here, the latency times suddenly significantly longer. I have an optimised prompt that takes 1 second, and now it’s suddenly 30 seconds. I am using both gpt-4o and gpt-4o-mini and have the same excessive latency.

Same here… the requests took more than 31.28 s, normally less than 5 s. Slow on both stream, non stream. Prod env cannot process the requests from the customers. Outages.

It’s solved for me now. Thanks! I am watching OpenAI Status page.

Here’s a minimal reproducible example of the abnormally high latency times. I am also facing similar times when using gpt-4o-mini, and when I converted the code below to langchain

from dotenv import load_dotenv
import os
import openai
import time

messages = [{"role": "user", "content": "What is 2+2?"}]


start_time = time.time()

session = OpenAI().chat.completions.create(
    model='gpt-4o',
    messages=messages,
)
end_time = time.time()
print(f"time taken: {end_time - start_time} seconds")
response_message = session.choices[0].message.content
print("Response:", response_message)

#time taken: 31.05855894088745 seconds
#Response: 2 + 2 equals 4.

Where are you located? Using proxy? Are you making the call from a data center?

let me share my case.

I have 2 api key
one is very long : ‘sk- proj-***********************************************************************************************************************************’
another is shorts : ‘sk- ***********************************’

they show different speed
long key is faster.
short key is slow.

how…

import { ChatOpenAI } from "langchain/chat_models/openai";
import { StringOutputParser } from "langchain/schema/output_parser";

let key = "sk-...."

const parser = new StringOutputParser();
  const model = new ChatOpenAI({
    modelName: "gpt-4o",
    temperature: 0.6,
    openAIApiKey: key,
  });


  const res = await model.pipe(parser).invoke([systemPrompt, humanPrompt]);
  return res;


It’s solved now!

We are using AWS lambdas to make the requests (Ireland based). The proxy uses cloudflare workers so should be quick but it happens even without the proxy.

long key is faster.

Tried re-rolling keys and no luck.