Comparison of Language & Translation Capabilities Across Models?

tobias9v · April 17, 2025, 7:28am

With the ever-growing number of language models available, it’s becoming quite a challenge to keep track of their strengths. The new model comparison page is definitely helpful, but I noticed it’s missing one key aspect: language and translation performance.

As someone who regularly uses these models for translation tasks, I’m particularly interested in how well each model handles different languages and literary styles. For instance, GPT-4.5 has consistently delivered excellent results in translation work — but with its discontinuation, it’s unclear which model now stands out in this area.

Has anyone come across a benchmark or comparison that evaluates the language coverage and translation quality of the current models? Something that includes side-by-side results or insights into multilingual training depth would be super helpful.

Thanks in advance for any pointers!

DavidOpenAl · April 30, 2025, 8:34am

Hi Tobias!

This is a very interesting and somewhat sensitive topic.
I’ve been translating for 26 years (I’m a Hungarian–German specialist translator), and I’ve been using OpenAI’s LLMs through the Playground since version 3.0. In the beginning, I had to make a lot of corrections, but it still proved helpful.

Version 4 did quite well, though for more serious texts (medical discharge summaries, complex contracts, etc.) I still had to edit quite a bit. What I noticed—and I don’t think it’s just my imagination—is that as soon as a new model comes out (e.g., 4.5), version 4o somehow isn’t as good anymore (I tested it out of curiosity using the exact same text, with the same prompt and temperature settings).

Actually, 80% of the translation quality depends on the prompt. I have a prompt for legal texts that’s over half a page long (and I still need to make corrections). I should add that since Hungarian is not a widely used language, it can’t be expected to be perfect, but it’s interesting that each model seems to “get dumber” (at least in my opinion) whenever a new one is released…

Currently, I went back using the “4o latest” model, which somehow is more consistent and handles longer texts a bit better.

Topic		Replies	Views
How good is ChatGPT3.5 / GPT4 translations?! API gpt-4 , api , translation , gpt35turbo	24	18971	April 22, 2025
Feedback on GPT-5 Model Performance for Translation Tasks Feedback	19	793	August 19, 2025
Latest GPT models are providing very poor results in many languages Feedback	2	65	August 23, 2025
Anyone doing successful translations with gpt 3.5? Prompting gpt-35-turbo	16	18537	September 17, 2024
What criteria are used to determine that newer models are "better" API	1	420	November 17, 2023

Comparison of Language & Translation Capabilities Across Models?

Related topics