Comparison of Language & Translation Capabilities Across Models?

With the ever-growing number of language models available, it’s becoming quite a challenge to keep track of their strengths. The new model comparison page is definitely helpful, but I noticed it’s missing one key aspect: language and translation performance.

As someone who regularly uses these models for translation tasks, I’m particularly interested in how well each model handles different languages and literary styles. For instance, GPT-4.5 has consistently delivered excellent results in translation work — but with its discontinuation, it’s unclear which model now stands out in this area.

Has anyone come across a benchmark or comparison that evaluates the language coverage and translation quality of the current models? Something that includes side-by-side results or insights into multilingual training depth would be super helpful.

Thanks in advance for any pointers!

1 Like

Hi Tobias!

This is a very interesting and somewhat sensitive topic.
I’ve been translating for 26 years (I’m a Hungarian–German specialist translator), and I’ve been using OpenAI’s LLMs through the Playground since version 3.0. In the beginning, I had to make a lot of corrections, but it still proved helpful.

Version 4 did quite well, though for more serious texts (medical discharge summaries, complex contracts, etc.) I still had to edit quite a bit. What I noticed—and I don’t think it’s just my imagination—is that as soon as a new model comes out (e.g., 4.5), version 4o somehow isn’t as good anymore (I tested it out of curiosity using the exact same text, with the same prompt and temperature settings).

Actually, 80% of the translation quality depends on the prompt. I have a prompt for legal texts that’s over half a page long (and I still need to make corrections). I should add that since Hungarian is not a widely used language, it can’t be expected to be perfect, but it’s interesting that each model seems to “get dumber” (at least in my opinion) whenever a new one is released…

Currently, I went back using the “4o latest” model, which somehow is more consistent and handles longer texts a bit better.