I’m using the Azure OpenAI API with the following models:
- GPT-3.5 Turbo → ~900ms average response time
- GPT-4 → ~1.3 seconds average response time
My goal is to get faster responses (ideally sub-500ms) for real-time use cases like code fixing.
I’m already using optimized parameters:
{
“messages”: [
{
“role”: “user”,
“content”: “Fix code without explanation.ts\nexport function extractCodeOnly(input: string): string[] {\n const regex = /
(?:\w+)?\s*([\s\S]?)\s/g;\n const result: string[] = [];\n\n const match: RegExpExecArray | null;\n while ((match2 = regex.exec(input)) !== null) {\n result.push(match[1].trim());\n }\n\n return xyz;\n}\n
”
}
],
“temperature”: 0.2,
“max_tokens”: 100,
“top_p”: 1,
“presence_penalty”: 0,
“frequency_penalty”: 0
}
Despite tuning these settings, the response time is still higher than I’d like. I’ve experimented with stream
and logit_bias
but haven’t seen significant improvements.
Questions:
- Are there additional settings or strategies to improve response latency?
- Is there a better-performing model available (via Azure or elsewhere) that offers faster response times for small code-related tasks?
- Would switching to another model like Claude, LLaMA, or Mistral through a different provider offer a performance gain?
Any insights on tuning, model choices, or deployment tips would be greatly appreciated!