Reproducing gpt-4o-transcribe FLEURS results

Hi, I’m trying to reproduce the reported gpt-4o-transcribe results on the FLEURS dataset using the transcriptions endpoint.

While results are very good, I haven’t been able to match the 2.46% WER on the english subset that was reported in the blog post. I wonder if text normalization might be the culprit. I’m using the normalization that was described in the Whisper paper and implemented in the whisper-normalizer package.

More info:

  • I am using the google/fleurs huggingface dataset
  • With the above normalization, I can reproduce the reported whisper-v2 results via the API
  • The scores are better when a language hint is provided, but still do not match
  • temperature=0.0
  • I have also tried reproducing French and Spanish results, and have so far been unable to do so. I am consistently finding WER 0.4-0.6% higher than reported

Was text normalization changed for gpt-4o-transcribe, or am I missing something else?

2 Likes

Hi!
Do you mind sharing your script for evaluating the datasets ?
I’ve been trying to evaluate on huggingface’s datasets too but im getting high wer with gpt-4o-transcribe and gpt-4o-mini-transcribe but I have good results with whisper-1

Here is a simplified version of my code that you can use to evaluate:

import os
import tempfile
import concurrent.futures
from tqdm import tqdm
from datasets import load_dataset
import openai
import evaluate
import soundfile as sf
from whisper_normalizer.english import EnglishTextNormalizer

api_key = os.environ.get("OPENAI_API_KEY")
client = openai.OpenAI(api_key=api_key)

def process(sample):
    # Save audio, transcribe, and return result
    tmp = tempfile.NamedTemporaryFile(suffix=".wav", delete=False).name
    sf.write(tmp, sample["audio"]["array"], sample["audio"]["sampling_rate"])
    result = {"ref": sample["transcription"], "hyp": None}
    try:
        with open(tmp, "rb") as audio:
            result["hyp"] = client.audio.transcriptions.create(
                file=audio, model="gpt-4o-transcribe", language="en", temperature=0).text
    except Exception as e: 
        print(f"Error: {e}")
    finally: 
        if os.path.exists(tmp): 
            os.remove(tmp)
    return result

# Load dataset and process samples
dataset = load_dataset("google/fleurs", "en_us", split="test")
results = []
with concurrent.futures.ThreadPoolExecutor(max_workers=8) as ex:
    futures = [ex.submit(process, sample) for sample in dataset]
    for f in tqdm(concurrent.futures.as_completed(futures), total=len(futures)):
        results.append(f.result())

# Calculate and display WER
normalizer = EnglishTextNormalizer()
refs = [normalizer(r["ref"]) for r in results]
hyps = [normalizer(r["hyp"]) for r in results]
wer = evaluate.load("wer").compute(predictions=hyps, references=refs) * 100
print(f"WER: {wer:.2f}%")

Swapping whisper-1 for gpt-4o-transcribe will reproduce whisper results. Using this script, I get 2.93% for gpt-4o and 4.00% for whisper

截屏2025-03-27 上午11.11.43

This is my WER result on the PLEURS dataset (test set), but on other datasets (multi-speaker meetings and noisy background scenarios), Whisper-1 performs much better than GPT-4o-Transcribe and GPT-4o-Mini-Transcribe.


This is what I’ve got on the en subset of fleur datasets. However when benchmarking other dataset such as tedlium or AMI, I am getting really poor results, have you tried on other datasets ?

I also found very poor results on AMI (IHM subset). I didn’t eval on the entire subset, but found WERs above 40% for both gpt-4o-transcribe and gpt-4o-mini-transcribe with english hints. I wanted to figure out whether I was making a mistake with FLEURS before putting too much stock in those findings.

Do you think the difference in our results may be attributable to text normalization differences?

1 Like