Hi everyone,
I’m using the TTS API, specifically the gpt-4o-mini-tts model, to generate high-quality audio. I’ve noticed a significant quality difference between the audio generated via the API and what I get from web services using OpenAI (like openai.fm) or the Playground itself.
The audio from the Playground is crystal-clear and full. The audio I generate via the API, even when requesting pcm format and saving it as a .wav file, has a slight but noticeable distortion/artifact, almost a metallic “crackle,” especially on sibilants and high tones.
My workflow is as follows:
-
I make a request to the API asking for response_format=“pcm” to get the highest quality raw data.
-
I receive the raw pcm data stream.
-
I save this data into a .wav file using the correct parameters (24000 Hz sample rate, 16-bit, mono), which I confirmed by analyzing the files from the Playground.
This is the core Python snippet I use for generation and saving:
from openai import OpenAI
from pathlib import Path
import numpy as np
import soundfile as sf
# Client Initialization
client = OpenAI(api_key="YOUR_API_KEY")
# API Call
response_pcm = client.audio.speech.create(
model="gpt-4o-mini-tts",
voice="onyx",
input="This is a high-quality audio test.",
response_format="pcm"
)
# Reading and Saving the data
pcm_data = response_pcm.read()
audio_array = np.frombuffer(pcm_data, dtype=np.int16)
sf.write(
"api_output.wav",
audio_array,
samplerate=24000
)
My question is:
Are there any undocumented parameters or specific techniques for the client.audio.speech.create API call that could influence the final audio quality, beyond just the format selection? For example, parameters related to the PCM codec’s bitrate, dithering techniques, or anything else?
My goal is to replicate the same clean, crystal-clear audio quality via the API that is achievable through the Playground.
Any suggestions or insights would be greatly appreciated. Thanks