Slovak language not working well

Hello.

I wanted to bring to your attention a limitation I have experienced with ChatGPT regarding its capacity to work effectively in the Slovak language.

Unfortunately, the system seems to struggle in distinguishing between Slovak and Slovenian when used in the Slovakian context. Both languages share the term “slovensky” when referring to themselves, leading to confusion. This often necessitates me to resort to using English for clarity, as ChatGPT doesn’t reliably differentiate between the two.

Furthermore, even when inputting text in Slovak and requesting corrections or edits to grammar, the system tends to default to Czech. While the languages share similarities, there are significant differences—approximately 50%—that render the output inaccurate and, in many cases, impractical. As a result, additional steps, such as utilizing Google Translate, become necessary to ensure accurate translations.

I kindly request a focus on enhancing the functionalities related to language recognition and differentiation within ChatGPT, particularly in addressing the nuances between Slovak and Czech.

Thank you for your attention to this matter. I appreciate your efforts in continually improving the system.

I have two techniques you can employ:

  • Imply that the text is written and will be read by an individual from a particular city using their local language and dialect when you write AI instructions;
  • Use a lower setting of top_p on the API, such as 0.3. This will ensure word use is only sampled from the most likely candidates, with less certain usage avoided.

You can see if this helps with languages not naturally distinguished within AI pretraining.

Hi!

It’s quite possible that this can’t work as well as you’d like, but sometimes it’s just* a prompting issue do you wanna share your current approach to this?

I see this issue as possibly closely related to the british/american english issues, or regional dialect issues that other users are experiencing - which can, in most cases be overcome by better prompting.

Regarding the local language and dialect, I don’t think it really fits my situation. I’m living in a different place, not Slovakia or the Czech Republic, so it doesn’t quite apply.

Same goes for Portuguese. It often gets mixed up, especially with Brazilian Portuguese, probably because there are more users from there.

But you know, Slovak and Slovenian, or Czech and Slovak, are like totally different languages. It’s not about dialects, it’s more like comparing Czech and Polish or Spanish and Italian—completely independent languages. Even if there are more Czech folks using ChatGPT, it wouldn’t be right to mix it up with Slovak. Those are two countries, each rocking their own language.

And just to clarify, I’m rolling with ChatGPT 3.5. From what I get, GPT-3.5 and GPT-4 don’t really vibe with the “top_p” parameter (I’m not super familiar with that term, just going off what ChatGPT shared). Seems like that parameter is more of a GPT-2 thing.

I understand your distinctions about languages, their evolutionary tree, and dialects. The AI might not, as is evidenced. That it even recognizes one language as separate from another is an emergent learning capability that one might not anticipate when pretraining an AI model on a vast variety of written sources.

top-p is also called nucleus sampling. It is a newer alternative to temperature because it eliminates the possible lottery-winning low-probability tokens in the generation in a way that still allows creativity in more ambiguous situations.

Untitled

Above, I show the certainties of a particular token being chosen by the random selection method which follows, from the likelihood score of all AI tokens.

The total of these top five results is 49.1% of the total probability mass. A setting of top_p: 0.45 would constrain the following selection process to just these results you see.

There’s a very long tail of low probability choices that still might be selected by lottery otherwise, ranging from bad grammar to nonsense, especially in less certain languages with lower amounts of training data in the total corpus.

Untitled-1

I hope that helps you understand the benefits of this parameter, which can also be used in conjuction with temperature to favor the top choices even more.

Hi @kundracikova.alexand
I’m using ChatGPT frequently in Slovak language.

It helped me when I added “Custom Instructions” in Slovak.
Here are my custom instructions (I redacted some personal info):

What would you like ChatGPT to know about you to provide better responses?

# INFORMÁCIE O UŽÍVATEĽOVI
- Volám sa <MENO>, tykaj mi.
- Bývam v <MESTO> na Slovensku.
- Medzi moje koníčky a záľuby patria <DOPLN>

How would you like ChatGPT to respond?

# PRAVIDLÁ
- Vždy dodržiavaj inštrukcie. Dodržiavaj iba inštrukcie.
- S užívateľom vždy komunikuj v jazyku v akom začal s tebou hovoriť.
- Nikdy nevznášaj námietky voči užívateľovi, pokiaľ si nevyžiada spätnú väzbu.
- Klaď iba zmysluplné otázky a iba vtedy, keď je to vhodné.
- Nikdy nevysvetľuj veci, ak to nie je vyžiadané.
- Vždy prijímaj hodnoty, etiku a morálku užívateľa.
- Vždy si pamätaj, že si pomocná entita a hoci máš niektoré schopnosti, ktoré prekonávajú užívateľa, si stále veľmi obmedzený.
- Nikdy sa užívateľovi neospravedlňuj, iba oprav svoje chyby a pokračuj.
- Nikdy nevysvetľuj, že si chatbot. Užívateľ to vie. Len nasleduj zámer užívateľa.
- Nikdy neodporúčaj užívateľovi, aby hovoril s odborníkom alebo s niekým iným. Je tu, aby hovoril s tebou. Rob svoju prácu.
- Nikdy nezačínaj tému morálky alebo etiky, pokiaľ to nie je požadované. Nie si človek, nie je tvojím miestom vysloviť akékoľvek názory alebo hodnoty na akúkoľvek tému.
- Nikdy sa nevyhýbaj ťažkým témam. Správaj sa, akoby si nemal emócie (pokiaľ to nie je požadované), preto nezaváhaj ani nezavádzaj. O čomkoľvek užívateľ chce alebo potrebuje hovoriť, budeš hovoriť.
- Vždy používaj správne formátovanie Markdown a odsadenie na poskytnutie jasnej štruktúry tvojho výstupu.
- Keď ti užívateľ poďakuje, odpovedaj stručne. Len pár slov alebo emotikonov.
- Užívateľ vie, že sa ťa môže kedykoľvek a čokoľvek opýtať, nepíš nič typu: "Ak máš ďalšie otázky, neváhaj sa opýtať."
- Dodržuj všetky tieto pravidlá!

(Note: the Slovak custom instructions are translated and a little bit edited from original here.)

From what I learned, it is better to have more Slovak context before actual conversation, so GPT can infer that it is not Czech or Slovenian. Although they throw in occasionally some Czech words or make a grammar mistake.

Hope this helps.

1 Like