I would switch. It would solve your first problem. And you wouldn’t even need AI for cleanup, so it also solves your second problem. It’s a simpler design without the AI cleanup.
I had AWS Transcribe forever, and the word error rates were atrocious.
SWITCH!
Then if that doesn’t work, especially after prompting the model for better transcriptions, then you can try the AI cleanup.
Does the UK prime minister speak English better than the Japanese prime minister speaks Japanese?
Subjectively, hard to compare from an individual standpoint unless you are native-level fluent and literate in both.
Whisper large-v2 scores with a slightly lower Japanese word error rate on CommonVoice9, but a higher error rate on Fleurs in the paper “Robust Speech Recognition via Large-Scale Weak Supervision”
It also might depend on if you are sampling Japanese gangster movies and needing accurate Kanji, or trying to understand Scottish English.