Intermittent Latency Spikes with Chat Completion API (GPT-4) in FastAPI Application

Hello OpenAI community,

I’m encountering intermittent latency issues with the OpenAI Chat Completion API. I’m using the gpt-4o-mini model in an asynchronous FastAPI application, and while some responses are quick, others take significantly longer even though my prompt lengths are typically short.

Here’s a summary of the issue:

  • API: Using gpt-4o-mini model with the AsyncOpenAI client.
  • Symptoms: Response times fluctuate between a few seconds and up to 18 seconds, even for short prompts.
  • Environment: Running on FastAPI. Network and system resources are stable.
  • Usage: Short prompts, mostly direct questions and short messages.
  • Observations: Latency variations occur seemingly randomly, without correlation to prompt complexity.

Below is the code I’m using to call the API:

python

Copy code

from openai import AsyncOpenAI
from config.config import Config
from typing import List, Dict

class OpenAIClient:
    def __init__(self, api_key: str = Config.API_KEY_OPENAI):
        self.client = AsyncOpenAI(api_key=api_key)

    async def create_chat_completion(
        self,
        messages: List[Dict[str, str]],
        model: str = "gpt-4o-mini",
    ) -> str:
        try:
            completion = await self.client.chat.completions.create(
                model=model,
                messages=messages,
            )
            return completion.choices[0].message.content
        except Exception as e:
            print(f"Error in creating chat completion: {str(e)}")
            raise

    async def close(self):
        await self.client.close()

Has anyone else encountered similar response time fluctuations, especially with asynchronous calls? Could server load or region-related factors affect this, and would switching to a different OpenAI model potentially help? I deploy my serivce on an AWS EC2 instance in Singapore.