Davinci-text-003 worse than gpt-3.5-turbo in non-English language?


I am using the playground since a couple of days. I want to generate German short descriptions about German cities.

First attempt: “completion mode” with davinci-text-003.

A prompt with all settings set to default, would look like this:

"Schreibe eine Kurzbeschreibung über die Stadt “Hamburg”. (Write a short description about the city “Hamburg”.)

I was surprised that the answers vary a lot in regard to quality. There is an almost 30% chance that I get a result being unusable. It then responds clearly wrong information, such as “Hamburg is the biggest city in Germany”, which is actually Berlin and Hamburg 2nd largest. Given that Hamburg has a pop of 1.8M and Berlin 3.6M, this is a pretty big mistake.

For another smaller city that I know, it mixes up the cities’ amenities with those of a city nearby.

Asking it the same questions in English, I get better results. Hamburg is not the biggest city anymore, but it still tells me a church in “Kempten” is located in “Kaufbeuren”, which is wrong.

Second attempt: “chat-mode” with gpt-3.5-turbo.

None of such mistakes happen. The responses are totally usable and all information is correct.

I am wondering if gpt-3.5-turbo is another improved model compared to davinci-text-003.

Kind regards,


Hi @steluhh,

When you evaluate the performance of models, there are several things to consider here: First, how good is the synthesis of the answers with respect to your question, i.e. how flexible is the response to questions, grammar, syntax, etc.? On the other hand, how much factual data is available.

Especially with your question the difference becomes evident. You can imagine it as if you ask a not so smart person a question, but he has a lexicon available to answer you and a smart person with a worse lexicon. If you ask a question about the short description, the smarter person will probably give you a worse answer, but if you ask both people to write you a poem or create something new, the smarter person would probably have the better answer.

So to answer your question, it depends on the circumstances and training data. For pure “lexicon queries” I would use Chat GPT in your place. If you want to have code written then rather Davinci.

I hope this clarifies things for you, otherwise please feel free to ask :slight_smile:

1 Like