JSON Mode with GPT-4 turbo stops after 1050 token

Dear community and OpenAI Team :slight_smile:

I’m testing JSON-Mode in our app, and it reliably stops generating after 1050 generated token. Also, the before stopping, the model generates many whitespaces at the end

As I read on the forum, the output is capped at 4095 token. Why is it stopping after 1050 token in JSON Mode?

Also, are there any timelines on when the new turbo models will be production ready. We have a use case where the larger input context is extremely important!


1 Like

Are you using 3.5 or 4 with json mode?

Are you adding (!) in the prompt that the model should output json?

It’s gpt-4-turbo.

Are you adding (!) in the prompt that the model should output json?

Why do I need to add (!) to the prompt?

My bad, exclamation here was to highlight you must add inside the prompt that you want a json output and ideally description of the json. Example: Gpt-3.5-turbo-1106 is very slow - #26 by TonyAIChamp

Yes, the prompt includes a description of the JSON “Schema” (not a valid JSON-Schema though) I want the model to generate. All completions contain the JSON in the format I want. However, the models just stops generating a proper response after about ~1k token (sometimes mid-sentence, as you can see in the screenshot), adds those whitespaces and completely stops after 1050 token.

The prompt is already tested by hundreds of users on a daily basis. So this error happens regularly and always ends after 1050 token.

I am also getting limited or poor responses based on what appears to be a token limit.

Although, it looks more like my responses are all around the ~2,000 token limit.

What I figured out may be the case - responses in JSON are much shorter generally than responses without json: Json responses in gpt-3.5-turbo-1106 much shorter than without json? - #27 by TonyAIChamp

Important: when using JSON mode, you must also instruct the model to produce JSON yourself via a system or user message. Without this, the model may generate an unending stream of whitespace until the generation reaches the token limit, resulting in a long-running and seemingly “stuck” request. Also note that the message content may be partially cut off if finish_reason="length", which indicates the generation exceeded max_tokens or the conversation exceeded the max context length.

OpenAI should have released a classifier model to prevent people using JSON mode before reading the instructions

Even if you are asking for JSON you need to make it want to write JSON

1 Like

Thanks @RonaldGRuckus can you share the link for that for reference? Was banging my head against this one as well. Forum to the rescue.


Here you are. It may make sense to try the model without JSON mode first in development. I believe it’s an external validator. If the model doesn’t want to write JSON it results in an infinite loop.


1 Like

It definitely not an external validator as the answers to the exact same questions with and without json are extremely different in quality (following instructions) and size.

I don’t really understand what you mean, but generally … no…

It obviously makes a lot of sense that forcing GPT to write in JSON can result in very different qualities of responses.

I’ll play ball though.

What do you think the JSON mode does?

You will understand what I mean if you check out the conversation I posted the link to above :wink:

Overall it seems like json mode takes too much of the model’s attention, that very little attention is left to the rest of the instructions, which it starts to follow pretty poorly.

You are saying JSON mode is this?

You always reply in the json format with the fields:
- ai_message: your message to the Human
- status: assessment status
- status_comment (very detailed comment to the status, not to be shown to the Human)

It takes a lot of effort to make GPT-4 spam whitespace until it max tokens

Here is a direct link to response_format in the API documentation with that text.

For the GPT-4 vision model, OpenAI has set an unreasonably low max_tokens default value, almost to force you to use the specification. They may have done the same on the json model. You can include a max_tokens value of your own to avoid truncation. max_tokens is both a context length reservation for forming the output and the limit you will get.

(why forced short? To make you specify it of course. To cut their processing of AI gone loopy. The rate limiter can’t measure and count unknown outputs against you …)

The rest of the blurb you read is that json is not actually “guaranteed”, otherwise you wouldn’t get strings of nonsense. They put a check for the word “json” into the endpoint just to avoid the worst misapplication, but you should always specify what kind of JSON it is you wish to receive with lists of keys, examples, or schema for ideal results.

yes +

llm_params_gpt35turbo1106json = {
    "model": "gpt-3.5-turbo-1106",
    "temperature": 0,
    "response_format": { "type": "json_object" },
    "timeout": 5

If JSON mode is instructions why would there be a notice that we also need to set instructions to write JSON?

I’m not sure why you are asking me about this :slight_smile: I’m not from OpenAI


I got to play with the leaked version of the AI model outside of the chat endpoint.

Consider it like this: The AI can write functions. It has been trained on function-calling, and has a preference even without any language talking about functions. But it is not going to emit a properly formed “weather function” without that function specification done right.

The JSON model is just higher confidence in good output.

1 Like

100%. I have an “Idea Generator” that has been writing an array of shallow objects for months, hasn’t failed once (90% of it is done by me though LOL). No JSON mode needed.


Are you saying that JSON mode uses a different model? Not disagreeing. Genuinely interested in this thought