So, I’m using GPT 3.5 turbo to return answers using a large English language dataset of very technical information. When I ask the question in English I get a good straightforward, accurate answer in a 2-3 line completion.
However, when I ask exactly the same question in French, I get a much longer answer, complete with explanations and examples! The completion is 100% accurate but much more verbose.
Lots of English human training = lots of output length crippling
There was actually an OpenAI study early on that human reinforcement through outsourced second-language English speakers working quickly on tasks gives a preference to just simple answers and having the input repeated back without insight because of the quality of those reviewers, not looking deeply into technical meanings and what would actually be useful for the specialist that was asking.
Thus we now have gpt-3.5-turbo that rephrases exactly what I asked in bullet points without usefulness.
GPT not only has multiple variations of truth, but it also has different responses and behaviors based on the language you speak to it. Davinci is a perfect model to test this on.
More than often it will declare that it is Christian in English
This!
At times several pages of instructions are handed to clickworkers who have very good reason to assume that they will be denied their 20$ an hour if they actually take the time and think through each evaluation because that’s too slow.
Then there are cases when a foreign language project is just about to miss the deadline. Suddenly exclusion criteria will be adapted and the costs “explode” while the quality of work increases, or at least changes.
It’s a incentive problem that can be found in many, many models that use “high quality, human evaluated data”. Mini-GPT4 for example has quite a few flaws in the training data ( the image descriptions provided to the model are incorrect as they claim there are visible objects that are actually not there) and in consequence the model supposedly “hallucinates” objects into the image descriptions that are not on the images provided by the user.
I like Mini GPT 4 a lot but I was somewhat annoyed to learn that the model has to be retrained in order to practically evaluate the potential performance.
From all the above one would have thought that the English completion would be more comprehensive than the French one. French has a much smaller vocabulary and a much smaller user base.
Here’s the example I used:
Q. What is the difference between pressure altitude and density altitude?
A. The difference between pressure altitude and density altitude is that pressure altitude is the actual altitude of the aircraft in current conditions, while density altitude is the ISA altitude equivalent to the pressure altitude, taking into account the temperature deviation. Density altitude is important because it affects the aerodynamic and engine performance of the aircraft, as it will perform as if it is at a higher altitude than it actually is.
Q. quelle est la différence entre l’altitude pression et l’altitude densité
A. La différence entre l’altitude pression et l’altitude densité est que l’altitude pression est l’altitude réelle de l’avion dans les conditions actuelles, tandis que l’altitude densité est l’altitude équivalente à l’altitude pression selon les conditions ISA. L’altitude densité est très significative car les performances aérodynamiques et moteur dépendent directement de la densité de l’air. Par exemple, si la déviation ISA est de +20C et l’altitude pression est de 5000 pieds, l’altitude densité serait de 5000 pieds + (+20 x 120 pieds) = 7 400 pieds. Dans cet exemple, les performances aérodynamiques et moteur de l’avion volant à 5000 pieds seraient en fait équivalentes à un avion volant à 7400 pieds dans des conditions ISA. Il faudra plus de temps pour le décollage, il montera.