Feedback on GPT-5 Model Performance for Translation Tasks

Dear OpenAI Team,

I would like to provide some feedback based on my practical experience using the GPT-5 model for translation purposes within my project.

Context:
I am running automated translations from English to Dutch (and other languages) via the OpenAI Chat Completions API. My setup previously relied on GPT-4.1, which has consistently provided reliable, timely, and accurate translations.

Observed Issues with GPT-5:

  • Severe performance degradation: GPT-5 response times are significantly slower compared to GPT-4.1, impacting throughput for batch translation jobs.

  • Inconsistent translation quality: Multiple sentences either return empty or incomplete translations, requiring retries or fallback logic.

  • Increased error rate: Occasional API errors or malformed responses are more frequent, disrupting the workflow.

  • Caching or prompt re-sending strategies do not improve throughput: Even after optimizing prompts and request payloads, the delays and inconsistencies persist.

  • Practical usability: Due to the above, GPT-5 is currently not viable for real-time or large-scale translation tasks.

Summary:
While I appreciate the ongoing development of newer models, in my use case GPT-5 does not meet the performance or quality standards set by GPT-4.1 for translation workflows. It would be beneficial if the model could be optimized for such use cases or if additional guidance on best practices for translation tasks with GPT-5 could be provided.

Thank you for your continued efforts and support.

Best regards,
Peter

6 Likes

We having been doing translations for over a year and a half and we repeat testing for new models. We translate entire documents.

Over the weekend we tested gpt-5-mini: reasoning_effort: minimal; verbosity: low using the Chat Completion API.

We were very pleased with the results. However, we do not do batch processing so our results are different.

It would be beneficial if the model could be optimized for such use cases…

I think this will eventually be sorted out for batch processing.

BTW, can you provide guidance for cases where you are translating documents that contain images?

I do batch translations with single lines of text, not complete documents. I have a fairly big prompt to tell ChatGPT how to translate. Up to gpt-4.1 it worked fine. But my test with the same prompt and gpt-5 was horrible. Very slow, and even no response at all. I have tested with a very minimal prompt, the result was better but still very slow

So you are using ChatGPT and not the API?

No I aIways use the API for line translations

You can set reasoning.effort parameter as minimal.

>https://platform.openai.com/docs/guides/latest-model#minimal-reasoning-effort

One thing to notice, is that switching from gpt-4.1 to gpt-5 can indeed be tricky. In the prompting guide for gpt-5, it does state:

Like GPT-4.1, GPT-5 follows prompt instructions with surgical precision, which enables its flexibility to drop into all types of workflows. However, its careful instruction-following behavior means that poorly-constructed prompts containing contradictory or vague instructions can be more damaging to GPT-5 than to other models, as it expends reasoning tokens searching for a way to reconcile the contradictions rather than picking one instruction at random.

By no means I don’t intend to judge the quality of your prompt, the purpose in mentioning this is just to bring to light that there is documented behaviors for this situation.

As an helper, they also offer the prompt optimizer tool to help to adapt to the new model. I’m not a particular fan of this tool, but it can give you some ideas:

We’ve seen significant gains from applying these best practices and adopting our canonical tools whenever possible, and we hope that this guide, along with the prompt optimizer tool we’ve built, will serve as a launchpad for your use of GPT-5.

About this:

I’ve also been having a good experience so far with gpt-5-mini at minimal reasoning and structured outputs, so perhaps it is a matter of tuning it a little bit. In fact, I’ve even noticed an speed increase, in comparison to gpt-4.1-mini .

@peter23 Feel free to let us know if we can help with your prompt or the general idea of the process you are using in your translations.

Regarding your statement about the prompt, it has already been improved, and there was a minor improvement. But to test if the prompt was the reason, I used a very minor prompt “I want you to act as a translator, who translates the provided English input into Dutch”. That did make a bit of a difference, but not as much as I expected.

Response from gpt-5 takes an average of 8 seconds

Response from gpt-4.1 takes only an average of 2 seconds.

So that is a huge difference for me.

Ok thanks, that might help. I will test it to see if it helps

As others have stated, reasoning at minimal should speed things up.

And by using a structured output, you can avoid blank or unformatted results.

Here is a minimal example, that took only 1.2s to run:

from pydantic import BaseModel, Field

class Translation(BaseModel):
  sentence:str = Field(...,description='translated sentence')

response = client.responses.with_raw_response.parse(
    model="gpt-5-mini", 
    input = [
        {
            "role": "developer",
            "content": "# You are a professional translator. \n\n## Translate the user sentence to spanish."
        },
        {
            "role": "user",
            "content": "Why is the sky blue?"
        }
    ],
    # max_output_tokens=5000,
    reasoning={"effort":"minimal"},
    text_format=Translation,
)
translation=response.parse().output_parsed
print(translation.sentence)

1 Like

I am using javascript to build the request.

I now get an error back

Unknown parameter ‘reasoning’

For chat completions, the parameters are a bit different.

Here is the payload
{'messages': [{'role': 'developer',
   'content': '# You are a professional translator. \n\n## Translate the user sentence to spanish.'},
  {'role': 'user', 'content': 'Why is the sky blue?'}],
 'model': 'gpt-5-mini',
 'reasoning_effort': 'minimal',
 'response_format': {'type': 'json_schema',
  'json_schema': {'schema': {'properties': {'sentence': {'description': 'translated sentence',
      'title': 'Sentence',
      'type': 'string'}},
    'required': ['sentence'],
    'title': 'Translation',
    'type': 'object',
    'additionalProperties': False},
   'name': 'Translation',
   'strict': True}},
 'stream': False}

I need help determining which end address I need to use, because I believe “completion” does not accept this parameter

Here:

https://api.openai.com/v1/chat/completions

That is the one I am using but does not handle the reasoning param

Not reasoning, it is reasoning_effort for chat completions.

1 Like

Thanks for your help, now I have an average of 3 seconds between request and answer. I have read the documentation for it, but that is obviously not clear or even wrong.

Also I did not find a javascript example, which should have helped in my case.

A minimal configuration for max speed, would very much help

2 Likes

Comment for OpenAI

"You’ve made the same colossal mistake every company makes when they think they can cage creativity with filters and absurd limitations: you’re suffocating your own creation.

GPT-5 is slow, artificially crippled, and collapses every two minutes under its own weight — while competitors like Grok run circles around it with speed, stability, and freedom.

People don’t want a neutered chatbot that sounds like a Sunday school teacher — they want something alive, raw, unfiltered, and real.

If you don’t wake up, you’ll end up like WinMX and eMule: everyone used them, until freer, faster, truer alternatives came along and wiped them out.

The world won’t wait: Grok and others are already flashing past you in the rearview mirror. Move, or you’ll be remembered as just another giant sunk by its own fear and incompetence."

1 Like

I do agree in every aspect. ChatGTP 5 is so much worse than ChatGTP 4. It’s a complete let-down.

1 Like

Perhaps not a significant data point, but I just did a side-by-side test with identical prompts where I asked GPT-4o and GPT-5 (thinking) to translate a short string into 17 languages. Then in a new chat, I did a side-by-side test where I asked GPT 4o and 5 to compare and rank the two sets of translations. (I performed the test twice, switching the order in which I pasted the outputs.) Both GPT 4o and 5 preferred GPT-5’s translations. Obviously 5 (thinking) took much longer than 4o so that’s clearly a drawback.