BIG MISTAKE: the results do not coincide at all between GPT 3.5 and GPT-4

ioan.fantanaru · January 3, 2024, 12:00pm

I try to compare 2 results, from a text translated. You will see the difference (Left - GPT 3.5 and Right - GPT 4)

They are by no means identical. Apart from the fact that none of the translations is very good, it seems that they do not even coincide.

This is a big mistake, not to coincide. I mean, if I tell GPT 3.5 and GPT 4 to consider how much is 5+5, is it correct to give me the result 10 and the other 15?

This is a very big mistake. The results must coincide in both versions. Otherwise you do not know which of them it is true.

I understand that the Pro version is better, but it must be better in the Plugins and Extensions and Integrations. But the results are not allowed to be different.

This is the text I translated:


Printre afectiunile in care acupunctura se dovedeste utila enumeram :

* dureri musculo-osteo-articulare (lombalgii, sciatica, torticolis, [artroza](https://www.cdt-babes.ro/articole/artroza.php) in diferitele ei forme - gonartroza, coxartroza, artroza umarului, spondiloza, dureri de spate, de ceafa, solduri, brate)
* modificari structurale mici de coloana vertebrala ([hernia de disc](https://www.cdt-babes.ro/articole/hernie-de-disc.php) incipienta)
* in sfera digestiva : balonari, arsuri, hiperaciditate, dureri abdominale, constipatie, [diaree](https://www.cdt-babes.ro/articole/gastroenterita_diaree_estivala.php)
* in sfera sistemului nervos : [anxietate](https://www.cdt-babes.ro/articole/anxietatea.php), [insomnie](https://www.cdt-babes.ro/articole/igiena-somnului.php), sindrom neurovegetativ
* tulburari endocrine : dureri menstruale, menstre neregulate, infertilitate

In bolile cronice, acupunctura ajuta prin ameliorarea simptomatologiei, rarirea acutizarilor, reabilitare functionala.

EricGT · January 3, 2024, 1:19pm

Removed ‘Bug’ category.

This isn’t a bug; it’s not a significant error on the part of the Large Language Model (LLM).

The LLM selects different paths as it constructs the completion. If you’re expecting deterministic results, that’s not the purpose of an LLM. Its role is to transform sequences of tokens into another sequence of tokens. In your case, the input tokens likely constituted an instruction prompt, and the output is the generated completion.

If much of this is new to you, consider seeking resources to learn about neural networks, LLMs, instruction models, training, and related concepts.

To learn more about LLMs, watch videos by Andrej Karpathy:

Proficiency with these technologies requires acquiring substantial knowledge.

ioan.fantanaru · January 3, 2024, 3:34pm

I’m not good at Neural Networks. But I know the system must be improved. Because if they do not provide certain data, I and other people will not be able to use because there is a risk.

for example for translations. Chatgpt can be easily improved by comparing the texts already translated. Let’s say that Internet Achives has the same book translated into several foreign languages.

If Chatgpt makes a hash in all those books, he will identify the book on several languages, then differently will identify the sentences according to the keywords in the phrase. The books are translated correctly by typographers. So chatgpt must be able to learn by comparing translations, to improve quickly.

So far the translation is of note 7, while Google Translate is of grade 8.5

The most important thing is to translate any phrase to any language, no matter how hard.

At the student and student level it goes to use paraphrases and compositions. But in important things you need accuracy, not probabilities.

_j · January 3, 2024, 9:48pm

There’s a half-dozen AI models from OpenAI that can translate.

And if you are paying 15x more per sentence, getting the exact same quality from each would be counter to all expectations.

If you use the OpenAI API for programming calls to the AI models, you can reduce sampling parameters made available there, such as top_p, so that when there are many possibilities such as “spine” or “spinal column” or even “vertebral…” they aren’t randomly chosen each run for creative writing. You can have the most probabilitically-likely token choice chosen instead.

Within ChatGPT, either 3.5 or the “Plus” enabling 4.0, the generation of text tends to have more creative words. Multiple runs of the same translation task there on the same model with exactly the same chat won’t be the same each time.

ioan.fantanaru · January 4, 2024, 8:02am

how can I you use the OpenAI API for programming calls to the AI models, as to make translations very good?

(of course, I ask ChatGPT, but I get a very sofisticate answer)

Some easy steps ? Can you tell?

_j · January 4, 2024, 8:30am

It’s not a secret, it only needs some existing programming and computer skills or the willingness to learn.

https://platform.openai.com/docs/quickstart?context=python

Topic		Replies	Views
Translation quality inconsistent API gpt-35-turbo , prompting	7	172	October 29, 2024
Anyone doing successful translations with gpt 3.5? Prompting gpt-35-turbo	16	17731	September 17, 2024
How good is ChatGPT3.5 / GPT4 translations?! API gpt-4 , api , translation , gpt35turbo	23	16590	August 22, 2024
The AI contradicts itself API	2	1248	December 23, 2022
Gpt-35-turbo-instruc language translation API api	11	1238	January 19, 2024

BIG MISTAKE: the results do not coincide at all between GPT 3.5 and GPT-4

Related topics