Gpt-4-1106-preview messes up function call parameters encoding

_j · December 14, 2023, 3:39pm

It seems the solution would rather to be to turn away non-English speaking AI models.

The amount of time required to come up with a solution, some time next year, likely with this model being discarded, indicates to me a significant problem with the training.

A comparitive example of a small but expensive line of code: TinyLlama, who are training their own 1.1B model on 3T tokens over three months (training a small model on more data than Llama 70B), had a shuffle algorithm problem where some of training corpus was being repeated at the expense of other being omitted. They discarded 500 million tokens of pretraining but due to cost and time couldn’t go back further to redo.

I’m struggling to find a single DevDay announcement that turned out “good”…

romain · December 15, 2023, 12:39pm

Any news on this?

I would love to have a fix before “~ january”. Not to mention all the query we are doing in the void and wasting money around this.

cj784316 · December 15, 2023, 12:43pm

Any developments on this matter? Despite the recent update to the new models, it seems the issue is still persistent. Any insights or updates would be highly appreciated. Thank you.

macia.ac · December 15, 2023, 12:53pm

As a temporary solution we are calling again the model to fix the encoding issue without specifying the json_format, we are defining it as a “encoding issues corrector assistant” and we are passing the json as string. An improvement of this approach would be to only pass the wrongly encoded sentences to save tokens.

We have also tried lowering temperature and top_p, currently using 0.3 and 0.2 respectively and specifying in the prompt to ensure the use of only characters present in the spanish alphabet.
With this we are getting better responses (our use case it to extract literal fragments from a big text), meaning that they match with the original text, but in this case it never uses characters out of the ASCII space.

In this case, the fix step is better as fixing orthografic issues seems to be easier that fixing encodings where sometimes 2 corrupt chars were mapped to 1 correct char or viceversa, it seems that the task “fix orthographic issues” might be more present in the training set but I am just speculating here.

ggg · December 17, 2023, 7:14pm

In addition to unicode escape sequences \u\d\d\d\d we also get percent encoding sometimes, example:

Hus%C3%B8ysund

%C3%B8 = ø

Another example:

That image is after I tried:

Answer format(JSON utf-8, with support for æåø)

macia.ac · December 19, 2023, 3:00pm

Hello!!!

We are getting answers with correct encoding from time to time, not reliably, but I would say 60% of the answers are correct now.

Has anything changed with this model? (gpt-4-1106-preview)

adieuadieu · December 19, 2023, 11:16pm

Anecdotally we’re also seeing some improvement in gpt-3.5-turbo-1106. Still sometimes getting the \\\\uXXXX and %XX%X style encoding/escaping (and higher frequency for certain languages like Turkish) — but these we can programmatically clean up. But importantly, we’re no longer seeing the hopelessly/completely corrupted characters.

…Anecdotally.

Though for some languages like Turkish, we still get neat stuff like this:

"u00c7ocuu011funuzu00a0hem eu011flenceli vakit geu00e7irmesi u00e7in eitu015fli bir u015fekilde eitu015fleu015ftirmenize yardu0131mcu0131 olabilir. Ancak kuu00e7uu011fu00fcn u00e7ocuu011flaru0131nu0131n yutma riski olabilir. "

adieuadieu · December 20, 2023, 12:43am

In case it helps anyone, here’s our current “fixer” function to clean up output from gpt-3.5-turbo-1106. This is by no means perfect. And notably, not part of the “standard library”.

import html
import re
import urllib.parse

def fix_bunk_encoding(raw_function_call_arguments_output: str) -> str:
    # fix double escaped unicode (\\u0000 -> \u0000)
    cleaned_string = re.sub(
        r"\\\\u([0-9a-fA-F]{4})", r"\\u\1", raw_function_call_arguments_output
    )

    # fix unicode that was not escaped (u0000 -> \u0000)
    cleaned_string = re.sub(
        r"(?<!\\)u[0-9a-fA-F]{4}", lambda match: "\\" + match.group(0), cleaned_string
    )

    # unescape % encoded characters (e.g. %20 -> " ")
    cleaned_string = urllib.parse.unquote(cleaned_string)

    # unescape html entities (e.g. &uuml; -> ü)
    cleaned_string = html.unescape(cleaned_string)

    return cleaned_string

alvarolarre · December 21, 2023, 9:26am

Maybe it’s helpful, in spanish, using gpt-3.5-turbo-1106, we introduced one unicode code inside one message in the messages list, like this:
"content": "\u00bfCómo sé si me están intentado manipular?"
and the amount of bad characters in the response, has reduced significantly

macia.ac · December 21, 2023, 11:40am

we are getting significant improvements (bad encoded answer dropped to ~20%) by structuring the prompt like this using temperature = 0.3 as our task is to extract literal fragments from a big text:

***Context***
- You are an assistant...
***Objectives***
1- ...
2- ...
***Restrictions***
- Do not add any note or comment to the JSON output.
- Use only characters present in the input.
- All the content in your response must be in Spanish.

Hope it helps

dignity_for_all · December 21, 2023, 3:22pm

It seems the solution would rather to be to turn away non-English speaking AI models.

Will OpenAI trun non-English speaking AI models away?

Arno_Nym · December 21, 2023, 4:09pm

Maybe a workarround would be using GPT 4.1 and then feed the output into 3.5 to make it guess/correct the missing/broken characters. I did not try it yet though.

We are also dependent on the 4.1 JSON API. It is strange that the bug is not present in Python…

Is there any schedule on it getting fixed?

eslof.github · December 21, 2023, 5:07pm

I just started attempting to produce automated translations using the API with function calling, and similarly I am facing strange issues with international characters; å, ä and ö often become {\r}, {\"a} and {\"o}, and if not; they are often simply left out of the output, and some of the time correctly represented.
This happens even if I bring it up explicitly in the system prompt~

I’m guessing maybe the function calling fine-tuning was done with english only? Because it doesn’t struggle like this when simply generating localized copywriting…

felix.andreas · December 22, 2023, 8:58am

We experiencing the same issues when using the endpoint:

https://api.openai.com/v1/chat/completions

Surprisingly, we don’t have an issue when we use the models hosted by azure:

https://xxx.openai.azure.com/openai/deployments/yyy/chat/completions?api-version=2023-12-01-preview

patrick.g.olsen · December 27, 2023, 9:21pm

I’ve implemented a fix to run the output of the function call through the 3.5 turbo to get it to output the correct characters. It works flawlessly, and not that expensive and adds very little extra time on the function, at least for my use case. Hope that helps. REALLY annoying!

scio-labs · December 30, 2023, 1:23pm

Getting the same issue with model gpt-4-1106-preview by calling OpenAI Functions via Lanchain. Unfortunately, their JsonOutputFunctionsParser doesn’t seem to be capable of handling the incorrectly encoded characters well, which results in a lot of �� in the output JSON

rr · January 6, 2024, 12:22am

Do you guys still have this issue? Any known fixes?

pabloromeo · January 6, 2024, 8:47pm

This still seems to be considerably broken. At least in Spanish it’s pretty unusable and unpredictable. Not just the incorrectly encoded characters but also getting tons and tons of newlines eating up tokens like crazy for no reason.
Has anyone else been able to confirm that Azure endpoints don’t have this issue? If so we might have to migrate over to Azure.

_j · January 6, 2024, 11:15pm

It’s a model-related issue. I think its unlikely Microsoft would have a fix, yet OpenAI users would be stuck.

You can go back to 0613 versions and not solicit (nor receive) parallel tool calls (with its smaller context length and better overall quality). Imagine it’s October.

The arguments dumping a whole bunch of newline (either escaped or the code) can be the forced json-mode of functions. You can specify “valid json” all over the function description and system message to reduce the chance of this, but the behavior is still brought out strong in high-UTF8 or Unicode characters, even on later json substrings after English.

Attempts to improve garbage function outs:

nishio.hiroshi · January 7, 2024, 9:13am

Hello,

Here would be useful for repeated \n\n\n…

github.com/openai/openai-node

SyntaxError: Unexpected non-whitespace character after JSON at position X

opened 04:28PM - 21 Dec 23 UTC

closed 06:16PM - 21 Dec 23 UTC

hiroshinishio

bug

### Confirm this is a Node library issue and not an underlying OpenAI API issue … - [X] This is an issue with the Node library ### Describe the bug The following arguments were received at Function Calling. ``` { functionArgText: `{"query": "What do I want to be by the time I'm 40?"}{"query": "Where do I want my career to be by the time I'm 40?"}{"query": "What kind of family do I want to have by the time I'm 40?"}` } ``` And trying to parse the above, `SyntaxError: Unexpected non-whitespace character after JSON at position 53` I was told "SyntaxError: Unexpected non-whitespace character after JSON at position 53 For some reason, it often happens that similar queries are repeated 3 times, like this one. What should I do? ### To Reproduce 1. Use OpenAI API via Node.js 2. Set a Tool ``` export const googleSearchSchema: AssistantCreateParams.AssistantToolsFunction = { type: "function", function: { name: "googleSearch", description: "Search the web using Google Custom Search Engine API.", parameters: { type: "object", properties: { query: { type: "string", description: `The search query. Multiple search terms should be separated by spaces. DO NOT use line breakers such as "\\n". The number of terms should be between 1 and 10. Less is better. Use Google search operators to refine your search. For example 1, "site:en.wikipedia.org" will search only Wikipedia.`, }, }, required: ["query"], }, }, } as const; ``` ### Code snippets _No response_ ### OS macOS ### Node version v20.10.0 ### Library version "openai": "^4.24.0",

Kind regards,

Topic		Replies	Views
The GPT-4-1106-preview model keeps generating "\\n\\n\\n\\n\\n\\n\\n\\n" for an hour when using functions API chatgpt , api	9	2506	December 31, 2023
Json format causes infinite "\n \n \n \n" in response API gpt-4 , api , json-mode	20	9081	February 21, 2025
Bad results when using fine-tuned model with function calling API fine-tuning , function-calling , fine-tuning-problems	15	4613	November 23, 2023
Support of unicode in gpt4-1106-preview Bugs gpt-4 , api	10	2251	November 15, 2024
1106 tool use, function parameters encounter Chinese garbled characters API api	2	1224	December 2, 2023

Gpt-4-1106-preview messes up function call parameters encoding

Related topics