Turkish character encoding with API

manzonegaetano · October 23, 2024, 5:44am

Hi everyone, I am trying to use the API with GPT-3.5 Turbo to evaluate quality translation using the quality metrics GEMBA. The metrics takes two inputs files with the same number of lines representing the source text and target translation.

When using UTF-8 for my input .txt files, the outputs signalize that there are problems with the character encoding not being processed correctly. I did some research here in the forum and saw a post about having to switch to Latin-1 for the model to process the input correctly.

This has worked for languages like French and German, since they are supported by Latin-1; other languages like Turkish don’t seem to work with that type of encoding and don’t recognize all Turkish characters.

Turkish uses the following special characters that are not part of Latin-1:

Ç (U+00C7 in Unicode) and ç (U+00E7) — although these are covered by Latin-1.
Ğ (U+011E) and ğ (U+011F) — not in Latin-1.
İ (U+0130) — not in Latin-1.
ı (U+0131) — not in Latin-1.
Ş (U+015E) and ş (U+015F) — not in Latin-1.

Does anybody know a workaround for this issue?

I am happy to hear from you!

Topic		Replies	Views
API Response encoding Bug \| UTF-8/UTF-16 API gpt-35-turbo , chatgpt , api , assistants-api	5	1841	April 10, 2024
Asking for Spanish text gpt-3.5-turbo-1106 sends back weird symbols API	1	1090	December 4, 2023
Invalid file format - Encoding issue API	3	1409	October 1, 2023
Gpt-4-1106-preview is not generating utf-8 API gpt-4-turbo	8	8492	February 17, 2024
Character encoding (black diamond output) API	6	1569	December 24, 2023

Turkish character encoding with API

Related topics