Gpt-4-1106-preview messes up function call parameters encoding

It seems the solution would rather to be to turn away non-English speaking AI models.

The amount of time required to come up with a solution, some time next year, likely with this model being discarded, indicates to me a significant problem with the training.

A comparitive example of a small but expensive line of code: TinyLlama, who are training their own 1.1B model on 3T tokens over three months (training a small model on more data than Llama 70B), had a shuffle algorithm problem where some of training corpus was being repeated at the expense of other being omitted. They discarded 500 million tokens of pretraining but due to cost and time couldn’t go back further to redo.

I’m struggling to find a single DevDay announcement that turned out “good”…

1 Like

Any news on this?

I would love to have a fix before “~ january”. Not to mention all the query we are doing in the void and wasting money around this.

1 Like

Any developments on this matter? Despite the recent update to the new models, it seems the issue is still persistent. Any insights or updates would be highly appreciated. Thank you.

As a temporary solution we are calling again the model to fix the encoding issue without specifying the json_format, we are defining it as a “encoding issues corrector assistant” and we are passing the json as string. An improvement of this approach would be to only pass the wrongly encoded sentences to save tokens.

We have also tried lowering temperature and top_p, currently using 0.3 and 0.2 respectively and specifying in the prompt to ensure the use of only characters present in the spanish alphabet.
With this we are getting better responses (our use case it to extract literal fragments from a big text), meaning that they match with the original text, but in this case it never uses characters out of the ASCII space.

In this case, the fix step is better as fixing orthografic issues seems to be easier that fixing encodings where sometimes 2 corrupt chars were mapped to 1 correct char or viceversa, it seems that the task “fix orthographic issues” might be more present in the training set but I am just speculating here.

2 Likes

In addition to unicode escape sequences \u\d\d\d\d we also get percent encoding sometimes, example:

Hus%C3%B8ysund

%C3%B8 = ø

Another example:

That image is after I tried:

Answer format(JSON utf-8, with support for Ìüø)

Hello!!!

We are getting answers with correct encoding from time to time, not reliably, but I would say 60% of the answers are correct now.

Has anything changed with this model? (gpt-4-1106-preview)

Anecdotally we’re also seeing some improvement in gpt-3.5-turbo-1106. Still sometimes getting the \\\\uXXXX and %XX%X style encoding/escaping (and higher frequency for certain languages like Turkish) — but these we can programmatically clean up. But importantly, we’re no longer seeing the hopelessly/completely corrupted characters.

…Anecdotally.

Though for some languages like Turkish, we still get neat stuff like this:

"u00c7ocuu011funuzu00a0hem eu011flenceli vakit geu00e7irmesi u00e7in eitu015fli bir u015fekilde eitu015fleu015ftirmenize yardu0131mcu0131 olabilir. Ancak kuu00e7uu011fu00fcn u00e7ocuu011flaru0131nu0131n yutma riski olabilir. "

:man_shrugging:

In case it helps anyone, here’s our current “fixer” function to clean up output from gpt-3.5-turbo-1106. This is by no means perfect. And notably, not part of the “standard library”.

import html
import re
import urllib.parse

def fix_bunk_encoding(raw_function_call_arguments_output: str) -> str:
    # fix double escaped unicode (\\u0000 -> \u0000)
    cleaned_string = re.sub(
        r"\\\\u([0-9a-fA-F]{4})", r"\\u\1", raw_function_call_arguments_output
    )

    # fix unicode that was not escaped (u0000 -> \u0000)
    cleaned_string = re.sub(
        r"(?<!\\)u[0-9a-fA-F]{4}", lambda match: "\\" + match.group(0), cleaned_string
    )

    # unescape % encoded characters (e.g. %20 -> " ")
    cleaned_string = urllib.parse.unquote(cleaned_string)

    # unescape html entities (e.g. &uuml; -> Ăź)
    cleaned_string = html.unescape(cleaned_string)

    return cleaned_string
6 Likes

Maybe it’s helpful, in spanish, using gpt-3.5-turbo-1106, we introduced one unicode code inside one message in the messages list, like this:
"content": "\u00bfCĂłmo sĂŠ si me estĂĄn intentado manipular?"
and the amount of bad characters in the response, has reduced significantly

2 Likes

we are getting significant improvements (bad encoded answer dropped to ~20%) by structuring the prompt like this using temperature = 0.3 as our task is to extract literal fragments from a big text:

***Context***
- You are an assistant...
***Objectives***
1- ...
2- ...
***Restrictions***
- Do not add any note or comment to the JSON output.
- Use only characters present in the input.
- All the content in your response must be in Spanish.

Hope it helps

4 Likes

It seems the solution would rather to be to turn away non-English speaking AI models.

Will OpenAI trun non-English speaking AI models away?

Maybe a workarround would be using GPT 4.1 and then feed the output into 3.5 to make it guess/correct the missing/broken characters. I did not try it yet though.

We are also dependent on the 4.1 JSON API. It is strange that the bug is not present in Python…

Is there any schedule on it getting fixed?

1 Like

I just started attempting to produce automated translations using the API with function calling, and similarly I am facing strange issues with international characters; ü, ä and Ü often become {\r}, {\"a} and {\"o}, and if not; they are often simply left out of the output, and some of the time correctly represented.
This happens even if I bring it up explicitly in the system prompt~

I’m guessing maybe the function calling fine-tuning was done with english only? Because it doesn’t struggle like this when simply generating localized copywriting…

We experiencing the same issues when using the endpoint:

https://api.openai.com/v1/chat/completions

Surprisingly, we don’t have an issue when we use the models hosted by azure:

https://xxx.openai.azure.com/openai/deployments/yyy/chat/completions?api-version=2023-12-01-preview
2 Likes

I’ve implemented a fix to run the output of the function call through the 3.5 turbo to get it to output the correct characters. It works flawlessly, and not that expensive and adds very little extra time on the function, at least for my use case. Hope that helps. REALLY annoying!

Getting the same issue with model gpt-4-1106-preview by calling OpenAI Functions via Lanchain. Unfortunately, their JsonOutputFunctionsParser doesn’t seem to be capable of handling the incorrectly encoded characters well, which results in a lot of ��� in the output JSON :confused:

Do you guys still have this issue? Any known fixes?

This still seems to be considerably broken. At least in Spanish it’s pretty unusable and unpredictable. Not just the incorrectly encoded characters but also getting tons and tons of newlines eating up tokens like crazy for no reason.
Has anyone else been able to confirm that Azure endpoints don’t have this issue? If so we might have to migrate over to Azure.

2 Likes

It’s a model-related issue. I think its unlikely Microsoft would have a fix, yet OpenAI users would be stuck.

You can go back to 0613 versions and not solicit (nor receive) parallel tool calls (with its smaller context length and better overall quality). Imagine it’s October.

The arguments dumping a whole bunch of newline (either escaped or the code) can be the forced json-mode of functions. You can specify “valid json” all over the function description and system message to reduce the chance of this, but the behavior is still brought out strong in high-UTF8 or Unicode characters, even on later json substrings after English.

Attempts to improve garbage function outs:

1 Like

Hello,

Here would be useful for repeated \n\n\n…

Kind regards,