Need an API That Combines Audio Transcription and Translation

Currently, to implement live translation using OpenAI’s APIs,

I must first transcribe audio using the transcribe model,

then send the transcribed text to the conversation model for translation.

This process creates two separate outputs, which slows down the overall response time.

If the transcription output could be handled internally and passed directly to the translation process—without being returned separately—it could significantly reduce the delay in real-time translation.

I would greatly appreciate an API that integrates these steps into a single request.

You can use gpt-4o-audio-preview and gpt-4o-mini-audio-preview (if you have access to it). It allows both audio and text when sending and receiving responses.

The response comes in audio and transcription if you enable both in modalities.

This example there shows how to enable text and audio modalities.
import base64
from openai import OpenAI

client = OpenAI()

completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={"voice": "alloy", "format": "wav"},
    messages=[
        {
            "role": "user",
            "content": "Is a golden retriever a good family dog?"
        }
    ]
)

print(completion.choices[0])

wav_bytes = base64.b64decode(completion.choices[0].message.audio.data)
with open("dog.wav", "wb") as f:
    f.write(wav_bytes)

More about audio models:
https://platform.openai.com/docs/guides/audio?api-mode=chat