I’ve noticed a significant downgrade. I’ve built a translation tool. When translating the same text from English to Dutch, I’ve observed some strange errors that GPT4-0314 doesn’t make at all. GPT4 (0613 and now the default model) appears to be more on par with 3.5-turbo, which is a massive downgrade.
We also notice it with more advanced coding questions. The results are much less usable, there are more errors, and it seems that GPT4 doesn’t understand at the level the 0314 model does.
This is not some minor downgrade, sometimes it misinterprets the entire context of a question. Hopefully things will improve, for now I’ll just use the 0314 model, which is unfortunate, since I did like the speed-improvements.
Can you post some examples of the prompts and results that are not to your expectations, if anything is translated could you provide a reason why the new one is bad and the old correct version for the non Dutch speakers, on the code side, just the code will be fine, thank you.
For example, I translated a blog, where this English sentence:
Please check my new Podcast!
Was translated into:
Controleer alstublieft mijn nieuwe Podcast!
The entire context is wrong. “Controleer” would be used if something needs be checked / inspected.
GPT4-0314 translated it to:
Bekijk alsjeblieft mijn nieuwe podcast!
This is a much better interpretation, although not perfect (it means ‘view my podcast’). After a second try with GPT4-0314 it changed it to “Luister naar mijn podcast”, which nailed it.
As of coding, it’s difficult to give a concrete example, since I send a lot of huge codeblocks…
What will happen though, is that the GPT4-0613 model and above seems to take ‘shortcuts’. It will give you only a portion of the code, it’s less complete and it seem to forget a lot of things which the old GPT4 simply doesn’t forget. I tried it side-by-side, and often change to the 0413 because I get frustrated. The 0413 model will give good results almost all the time.
A colleague of mine also had some issues while asking for some advanced changes in a HTML table regarding a RFM lifecycle model (numbers needed to be applied to certain scenarios). The new GPT4 just said: This is to advanced for me. And on the second try it just gave up, or gave some wrong answers. The GPT-0413 model almost completed the task without errors.
The translation info is certainly interesting, thanks for that.
To the coding part, the new model obeys the system message quite a bit more now, if you are performing coding tasks I find it beneficial to create a coding persona and place that in the system message. Along the lines of “You are an expert computer programmer that specialises {lang} you always produce code to the highest standards and use all industry best practices, when asked for code you produce entire blocks of complete accurate…” etc. etc.
This will set the model with the intent to produce code from an experts perspective and not that of a “helpful assistant”
Depending on additional context it could be understood as,
Please check [out] my new podcast!
Please check my new podcast [for errors]!
And while the first it’s more likely to be understood in a colloquial sense, the second is closer to a plain reading of the text.
There may also be a bit of randomness at play here. I’d be curious to see if you were to run this prompt 25 times, how often it goes each way.
Edit: I went ahead and did 10 runs with your prompt and with corrected English.
As you can see, the default model assumes you want someone to view your new podcast only 20% of the time in this sample. Whereas when using proper English, the default model has zero difficulty constructing a proper translation.
Incidentally, it seems even the default gpt-3.5-turbo has no issue doing the translation when phrased properly in English (though I don’t read Dutch and can’t independently verify this).
If this sort of thing is a real concern for you, your best bet is to do a set of evaluation prompts and save them (right click save as) and then try them again at a later point.
I did this a few months ago. I have yet to see failure, and I reword the prompts to make sure it’s not a caching thing.
And I have no problem calling out OpenAI on anything and everything, so you can be assured I’m not sucking up.
For the above, in english we don’t often say ‘please’ when marketing, though we might say it if we were asking someone to do some editing/verification. The “!” after it is weird, though.
I agree with others, it’s not a good example of failure.