How to reduce response latency in Azure OpenAI GPT-3.5/GPT-4 API or find a better-performing model?

I’m using the Azure OpenAI API with the following models:

  • GPT-3.5 Turbo → ~900ms average response time
  • GPT-4 → ~1.3 seconds average response time

My goal is to get faster responses (ideally sub-500ms) for real-time use cases like code fixing.

I’m already using optimized parameters:

{
“messages”: [
{
“role”: “user”,
“content”: “Fix code without explanation.ts\nexport function extractCodeOnly(input: string): string[] {\n const regex = /(?:\w+)?\s*([\s\S]?)\s/g;\n const result: string[] = [];\n\n const match: RegExpExecArray | null;\n while ((match2 = regex.exec(input)) !== null) {\n result.push(match[1].trim());\n }\n\n return xyz;\n}\n
}
],
“temperature”: 0.2,
“max_tokens”: 100,
“top_p”: 1,
“presence_penalty”: 0,
“frequency_penalty”: 0
}

Despite tuning these settings, the response time is still higher than I’d like. I’ve experimented with stream and logit_bias but haven’t seen significant improvements.

:red_question_mark: Questions:

  • Are there additional settings or strategies to improve response latency?
  • Is there a better-performing model available (via Azure or elsewhere) that offers faster response times for small code-related tasks?
  • Would switching to another model like Claude, LLaMA, or Mistral through a different provider offer a performance gain?

Any insights on tuning, model choices, or deployment tips would be greatly appreciated!

These models are quite outdated, is there a particular reason you use them?

The corresponding newer models for non reasoning tasks would be gpt-4.1, gpt-4.1-mini and gpt-4.1-nano. Or gpt-4o and gpt-4o-mini for the previous generation.

New models are usually faster.