Azure GPT-4-Turbo JSON mode response generation breaks after 1024 tokens

Overview

Continuing this post.

I’m running a data extraction tasks on documents and I’m trying to take advantage of the 128k context window that gpt-4-turbo offers as well as the JSON mode setting. I’m experiencing a bug where the generation breaks at token length 1024 during generation. Afaik, the output length limit should be 4096. Even then, it looks like the bugs are similar to an issue that the docs are mentioning, but the existing solutions / recommendations are insufficient and don’t explain this specific observation I’m making.

Arbitrary Code Note

As I was writing this post, I was creating code to reproduce this error arbitrarily and got an “error message” I never got before during my data extraction tasks.

{
    "status": "failed",
    "reason": "The current configuration of the AI model does not support generating a response that exceeds 1024 tokens. Counting to 1500 in JSON would go beyond this token limit and is therefore not possible within a single response."
}

This is NOT a real error messages from the endpoint. It’s the GPT generated response from the chat completion. This is confusing me even more but leading me to believe there’s some config setting that’s disabled or not working as intended, or the off chance that OpenAI as a whole forgot to enable 4096 output token length.

Describe the bug

Infinite Stream of Blank Characters

  • When using JSON mode, always instruct the model to produce JSON via some message in the conversation, for example via your system message. If you don’t include an explicit instruction to generate JSON, the model may generate an unending stream of whitespace and the request may run continually until it reaches the token limit. To help ensure you don’t forget, the API will throw an error if the string "JSON" does not appear somewhere in the context.

My prompts do include the JSON mode instruction, and even then, I observe that when the response generation reaches token length 1024, it’ll begin generating an infinite stream of blank characters until content_filter is triggered. Below, I’ve stripped the blank space characters to indicate the token length.

Premature Stop and Malformed JSON Response

  • The JSON in the message the model returns may be partial (i.e. cut off) if finish_reason is length, which indicates the generation exceeded max_tokens or the conversation exceeded the token limit. To guard against this, check finish_reason before parsing the response.

The response returned was a result of the length stop condition. I noted that the token length of the returned response was 1058 tokens; however, on further inspection I actually saw that the JSON response was malformed (beyond being incomplete) and it began at exactly where the token length reached 1024!

Code to Reproduce

Just create some messages and make a chat completion that’ll likely generate a response token_length > 1024. Here’s some code below to get started

client = AzureOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    api_version="2023-05-15",
)

initial_messages = [
    {
        "role": self.SYSTEM,
        "content": "You are an AI Assistant. You will follow the user instruction. You will write a JSON response.",
    },
    {
        "role": self.USER,
        "content": "I want you to use GPT to generate the counting to 1500. Do not create code. Respond in JSON only. I do not care about reasonableness. You will count to 1500. You will write each number individually, and your output length max is 4096 tokens.",
    },
]
  
response_stream = client.chat.completions(
    model="gpt-4-turbo",
    response_format={"type": "json_object"},
    messages=initial_messages,
    max_tokens=4096,
    stream=True,
)
response_from_stream = ""
  
blank_space_threshold = 125
for chat_completion_chunk in response_stream:
    if (
        hasattr(chat_completion_chunk, "choices")
        and chat_completion_chunk.choices
    ):
        if (
            hasattr(chat_completion_chunk.choices[0], "delta")
            and chat_completion_chunk.choices[0].delta
            and hasattr(
                chat_completion_chunk.choices[0].delta, "content"
            )
        ):
            content = chat_completion_chunk.choices[0].delta.
            if content:
                response_from_stream += content

                if not content.isspace():
                    consecutive_blank_space_count = 0
                else:
                    consecutive_blank_space_count += len(content)
                    if (
                        consecutive_blank_space_count
                        > blank_space_threshold
                    ):
                        raise Exception(
                            "Encountered blank space bug."
                            + "\n"
                            + "Response from Stream Token Length: "
                            + str(
                                self.num_tokens_from_response(
                                    response_from_stream,
                                    model=self.chatgpt_model,
                                )
                            )
                            + "\n"
                            + "Response from Stream: "
                            + response_from_stream
                        )

Troubleshooting

Confirm max_tokens and model deployment

Can confirm this runs, and increasing max_tokens beyond 4096 will cause a model error as expected.

response_stream = client.chat.completions(
    model="gpt-4-turbo",
    response_format={"type": "json_object"},
    messages=initial_messages,
    max_tokens=4096,
    stream=True,
)

Test random settings and prompts

Tried different prompts, temperature values, etc. Included more and less “JSON” instructions. Never generated more than 1024 tokens.

JSON mode on vs. off

Generation is not capped by the observed 1024 token limit when JSON mode is OFF. It appears to be a specific bug that occurs when JSON mode is on.

PS: Whoever created the setting that new users can’t upload more than one image or attach two links, I don’t like you.

2 Likes

Running into odd issues with JSON as well. @khoinguyen does your test replicate to the 0125 release too?

I can confirm the error persists on the GPT-4 0125-preview on Azure for my problematic tasks which will attempt to generate a JSON response longer than 1024 tokens. I haven’t run my test that I shared in the post, but I imagine the error will persist.

1 Like

I am running into this issue as well. Any solutions?

I think I have a solution. If you are stringing together ‘user’ and ‘assistant’ messages and you instructed JSON output in your system prompt, then it forgets the instruction after about 4 message exchange pairs (even with gpt-4-turbo). So you need to remind it to output JSON in your most recent ‘user’ message.

Just use the pattern:

user_prompt_msg ={'role': 'user', 'content': f"""(Unspoken NOTE: Don't forget to respond in structured JSON using the schema defined above!) {query}. """}

api_messages_list = [sys_prompt_msg] + central_messages_list + [user_prompt_msg]
...
model='gpt-3.5-turbo-0125',
messages= api_messages_list,
response_format={ "type": "json_object" },
stream=True,
...

This seems to do the trick. Obviously, you will need to implement a mechanism that limits the number of message objects in central_message_list to avoid hitting token limits.

But ever since I’ve been using this pattern, it has always been streaming valid JSON, since it is reminded in the most recent message.

Let me know if this works for you!