Need Help Improving Whisper API Accuracy for Short Words and Pronunciation Tasks

Hi everyone,

I’m using the Whisper API (model: whisper-1) for a pronunciation evaluation project where users record short words, and the API transcribes the audio. While the API works well for many cases, I’m experiencing accuracy issues, especially with short words like:

“Whistle” → transcribed as “We’ll see”
“Castle” → transcribed as “Casky” or “ASCII”
“Chocolate” → transcribed as “Talk later”
Interestingly, after 2-3 attempts, it starts transcribing correctly and scores the pronunciation as 100% accurate. However, I’d like to improve the accuracy and efficiency to get correct results on the first attempt consistently.

Current Setup:
I’m sending requests with the following parameters:

My php code:

$response = $client->post($apiUrl, [
‘headers’ => [
‘Authorization’ => 'Bearer ’ . $apiKey,
],
‘multipart’ => [
[‘name’ => ‘file’, ‘contents’ => fopen(audioFile, 'r'), 'filename' => _FILES[‘audio’][‘name’]],
[‘name’ => ‘model’, ‘contents’ => ‘whisper-1’],
[‘name’ => ‘language’, ‘contents’ => ‘en’], // Enforce English
[‘name’ => ‘temperature’, ‘contents’ => ‘0.0’],
[‘name’ => ‘prompt’, ‘contents’ => $_POST[‘expectedText’] ?? ‘’],
[‘name’ => ‘response_format’, ‘contents’ => ‘json’],
]
]);

Issues Faced:

Short words often get misinterpreted (e.g., “whistle” → “We’ll see”).
Requiring multiple attempts for accurate transcription.
Accuracy needs improvement for precise pronunciation-based tasks.

What I’ve Tried:

Lowering the temperature to 0.0 for deterministic outputs.
Adding an expected text as a prompt to guide the model.
Cleaning up audio using FFmpeg (sample rate: 16kHz, format: WAV).
Implementing a retry mechanism to resubmit audio up to 3 times if the output doesn’t match.

Questions:

How can I improve Whisper’s accuracy for short words and specific pronunciation tasks?
Are there other parameters or techniques I should use in the API request?
Would Whisper fine-tuning or alternative models be more suitable for my use case?
Any advice or insights from the community would be greatly appreciated!

Thanks in advance!