Hi, I’m trying to reproduce the reported gpt-4o-transcribe results on the FLEURS dataset using the transcriptions endpoint.
While results are very good, I haven’t been able to match the 2.46% WER on the english subset that was reported in the blog post. I wonder if text normalization might be the culprit. I’m using the normalization that was described in the Whisper paper and implemented in the whisper-normalizer package.
More info:
I am using the google/fleurs huggingface dataset
With the above normalization, I can reproduce the reported whisper-v2 results via the API
The scores are better when a language hint is provided, but still do not match
temperature=0.0
I have also tried reproducing French and Spanish results, and have so far been unable to do so. I am consistently finding WER 0.4-0.6% higher than reported
Was text normalization changed for gpt-4o-transcribe, or am I missing something else?
Hi!
Do you mind sharing your script for evaluating the datasets ?
I’ve been trying to evaluate on huggingface’s datasets too but im getting high wer with gpt-4o-transcribe and gpt-4o-mini-transcribe but I have good results with whisper-1
Here is a simplified version of my code that you can use to evaluate:
import os
import tempfile
import concurrent.futures
from tqdm import tqdm
from datasets import load_dataset
import openai
import evaluate
import soundfile as sf
from whisper_normalizer.english import EnglishTextNormalizer
api_key = os.environ.get("OPENAI_API_KEY")
client = openai.OpenAI(api_key=api_key)
def process(sample):
# Save audio, transcribe, and return result
tmp = tempfile.NamedTemporaryFile(suffix=".wav", delete=False).name
sf.write(tmp, sample["audio"]["array"], sample["audio"]["sampling_rate"])
result = {"ref": sample["transcription"], "hyp": None}
try:
with open(tmp, "rb") as audio:
result["hyp"] = client.audio.transcriptions.create(
file=audio, model="gpt-4o-transcribe", language="en", temperature=0).text
except Exception as e:
print(f"Error: {e}")
finally:
if os.path.exists(tmp):
os.remove(tmp)
return result
# Load dataset and process samples
dataset = load_dataset("google/fleurs", "en_us", split="test")
results = []
with concurrent.futures.ThreadPoolExecutor(max_workers=8) as ex:
futures = [ex.submit(process, sample) for sample in dataset]
for f in tqdm(concurrent.futures.as_completed(futures), total=len(futures)):
results.append(f.result())
# Calculate and display WER
normalizer = EnglishTextNormalizer()
refs = [normalizer(r["ref"]) for r in results]
hyps = [normalizer(r["hyp"]) for r in results]
wer = evaluate.load("wer").compute(predictions=hyps, references=refs) * 100
print(f"WER: {wer:.2f}%")
Swapping whisper-1 for gpt-4o-transcribe will reproduce whisper results. Using this script, I get 2.93% for gpt-4o and 4.00% for whisper
This is my WER result on the PLEURS dataset (test set), but on other datasets (multi-speaker meetings and noisy background scenarios), Whisper-1 performs much better than GPT-4o-Transcribe and GPT-4o-Mini-Transcribe.
This is what I’ve got on the en subset of fleur datasets. However when benchmarking other dataset such as tedlium or AMI, I am getting really poor results, have you tried on other datasets ?
I also found very poor results on AMI (IHM subset). I didn’t eval on the entire subset, but found WERs above 40% for both gpt-4o-transcribe and gpt-4o-mini-transcribe with english hints. I wanted to figure out whether I was making a mistake with FLEURS before putting too much stock in those findings.
Do you think the difference in our results may be attributable to text normalization differences?