I would like to provide some feedback based on my practical experience using the GPT-5 model for translation purposes within my project.
Context:
I am running automated translations from English to Dutch (and other languages) via the OpenAI Chat Completions API. My setup previously relied on GPT-4.1, which has consistently provided reliable, timely, and accurate translations.
Observed Issues with GPT-5:
Severe performance degradation: GPT-5 response times are significantly slower compared to GPT-4.1, impacting throughput for batch translation jobs.
Inconsistent translation quality: Multiple sentences either return empty or incomplete translations, requiring retries or fallback logic.
Increased error rate: Occasional API errors or malformed responses are more frequent, disrupting the workflow.
Caching or prompt re-sending strategies do not improve throughput: Even after optimizing prompts and request payloads, the delays and inconsistencies persist.
Practical usability: Due to the above, GPT-5 is currently not viable for real-time or large-scale translation tasks.
Summary:
While I appreciate the ongoing development of newer models, in my use case GPT-5 does not meet the performance or quality standards set by GPT-4.1 for translation workflows. It would be beneficial if the model could be optimized for such use cases or if additional guidance on best practices for translation tasks with GPT-5 could be provided.
I do batch translations with single lines of text, not complete documents. I have a fairly big prompt to tell ChatGPT how to translate. Up to gpt-4.1 it worked fine. But my test with the same prompt and gpt-5 was horrible. Very slow, and even no response at all. I have tested with a very minimal prompt, the result was better but still very slow
One thing to notice, is that switching from gpt-4.1 to gpt-5 can indeed be tricky. In the prompting guide for gpt-5, it does state:
Like GPT-4.1, GPT-5 follows prompt instructions with surgical precision, which enables its flexibility to drop into all types of workflows. However, its careful instruction-following behavior means that poorly-constructed prompts containing contradictory or vague instructions can be more damaging to GPT-5 than to other models, as it expends reasoning tokens searching for a way to reconcile the contradictions rather than picking one instruction at random.
By no means I donât intend to judge the quality of your prompt, the purpose in mentioning this is just to bring to light that there is documented behaviors for this situation.
As an helper, they also offer the prompt optimizer tool to help to adapt to the new model. Iâm not a particular fan of this tool, but it can give you some ideas:
Weâve seen significant gains from applying these best practices and adopting our canonical tools whenever possible, and we hope that this guide, along with the prompt optimizer tool weâve built, will serve as a launchpad for your use of GPT-5.
About this:
Iâve also been having a good experience so far with gpt-5-mini at minimal reasoning and structured outputs, so perhaps it is a matter of tuning it a little bit. In fact, Iâve even noticed an speed increase, in comparison to gpt-4.1-mini .
@peter23 Feel free to let us know if we can help with your prompt or the general idea of the process you are using in your translations.
Regarding your statement about the prompt, it has already been improved, and there was a minor improvement. But to test if the prompt was the reason, I used a very minor prompt âI want you to act as a translator, who translates the provided English input into Dutchâ. That did make a bit of a difference, but not as much as I expected.
Response from gpt-5 takes an average of 8 seconds
Response from gpt-4.1 takes only an average of 2 seconds.
As others have stated, reasoning at minimal should speed things up.
And by using a structured output, you can avoid blank or unformatted results.
Here is a minimal example, that took only 1.2s to run:
from pydantic import BaseModel, Field
class Translation(BaseModel):
sentence:str = Field(...,description='translated sentence')
response = client.responses.with_raw_response.parse(
model="gpt-5-mini",
input = [
{
"role": "developer",
"content": "# You are a professional translator. \n\n## Translate the user sentence to spanish."
},
{
"role": "user",
"content": "Why is the sky blue?"
}
],
# max_output_tokens=5000,
reasoning={"effort":"minimal"},
text_format=Translation,
)
translation=response.parse().output_parsed
print(translation.sentence)
Thanks for your help, now I have an average of 3 seconds between request and answer. I have read the documentation for it, but that is obviously not clear or even wrong.
Also I did not find a javascript example, which should have helped in my case.
A minimal configuration for max speed, would very much help
"Youâve made the same colossal mistake every company makes when they think they can cage creativity with filters and absurd limitations: youâre suffocating your own creation.
GPT-5 is slow, artificially crippled, and collapses every two minutes under its own weight â while competitors like Grok run circles around it with speed, stability, and freedom.
People donât want a neutered chatbot that sounds like a Sunday school teacher â they want something alive, raw, unfiltered, and real.
If you donât wake up, youâll end up like WinMX and eMule: everyone used them, until freer, faster, truer alternatives came along and wiped them out.
The world wonât wait: Grok and others are already flashing past you in the rearview mirror. Move, or youâll be remembered as just another giant sunk by its own fear and incompetence."
Perhaps not a significant data point, but I just did a side-by-side test with identical prompts where I asked GPT-4o and GPT-5 (thinking) to translate a short string into 17 languages. Then in a new chat, I did a side-by-side test where I asked GPT 4o and 5 to compare and rank the two sets of translations. (I performed the test twice, switching the order in which I pasted the outputs.) Both GPT 4o and 5 preferred GPT-5âs translations. Obviously 5 (thinking) took much longer than 4o so thatâs clearly a drawback.