Does ChatGPT understand indented JSON better than minified?

I’m wondering if chatgpt is any better at parsing indented json vs minified json (or code in general…)

I’m looking at an application that would involve asking natural language questions about large-ish JSON objets.

I could see it being better at understanding indented JSON, because with indentation, you need a lot less context to see how deep something is in the hierarchy (just look at how many spaces came after the last newline). And it feels like the indentation, while redundant, may serve as a cue that would make it easier to interpret the structure.

On the other hand, it’s kind of a token-burner, and ChatGPT itself claims to be perfectly happy with minified JSON when you ask it.

So - has anyone experimented with this - or have some good reasons for believing one side or the other?

2 Likes

I’m also curious to see if anyone else has actually quantified this. I assume the answer depends on how complex your data is because (like you imply) indentation gives it significantly more context. Since best (LLM) results are typically attained with more/better context, I personally indent JSON outputs and mark them with a #TODO to come back and test should the code ever make it to the light of day.

1 Like

To answer your question simply, as there is a lot of nuance philosophical considerations, is that yes, if you use spaces as indent to represent your JSON data, it will better understand the nature of JSON format. It understands your use of spacing as a form of a hierarchy. It’s more than interpreting the structure, it gains a deeper understanding of the relationship between data.

It’s not a token burner. Because in Byte Pair Encoding, the use of space indenting, the space becomes part of the first word, so it doesn’t use additional tokens. Leading spaces are included with the first word of the line. Indentation thus becomes part of the first token after the indentation.

Hello World
This is a fine day indeed
I think I’ll go pick apples

Can’t really see the spaces in this as the forum is correcting for them, but imagine the first has one space, second, two, and the last has four.

" Hello World" (1 space before “Hello”):
[" Hello", " World"]
The leading space is part of the " Hello" token.

" This is a fine day indeed" (2 spaces before “This”):
[" This", " is", " a", " fine", " day", " indeed"]
The two spaces are part of the " This" token.

" I think I’ll go pick apples" (4 spaces before “I”):
[" I", " think", " I’ll", " go", " pick", " apples"]
The four spaces are part of the " I" token.

Also notice that all the spaces in the words are part of the proceeding word, which means that spaces don’t have tokens. An exception to this is if you use double spaces, then the two spaces would be a token.

2 Likes

The number of tokens can vary greatly when indenting JSON using json.dumps, depending on the complexity of the nested structure.


3 Likes

Just so you know, using JSON the standard way as you are, you don’t need space indenting because the AI can already detect a hierarchy. At this point, the space indenting is for your benefit, not the AI’s. As I initially said, in terms of ‘nuance philosophical considerations,’ space indenting can help the AI see the hierarchy, but that’s only if you don’t establish a hierarchy within JSON. I’ve seen some who don’t do that, which is why I phrased it as I did. To clarify my point: if you have a hierarchy within the JSON format, then you don’t need space indenting. Otherwise, lacking a clear hierarchy within JSON, you do need space indenting.

My testing was done using Python with the ‘GPT2Tokenizer,’ which, as I understand, is an open-source version that underpins many GPT models. This tokenizer is quite different from those used in GPT-3.5 and GPT-4. Different tokenizers handle patterns in distinct ways. When I used the GPT2Tokenizer in Python, my focus was on accurately identifying all tokens. If I had taken JSON formatting into account, it might have influenced how the tokenizer counted tokens, particularly in relation to JSON patternization.

Looking at your example, the increase in tokens is not drastic but rather predictable within the context of the JSON format. You mentioned that the number of tokens ‘can vary greatly when indenting JSON,’ but that is not entirely accurate. I would encourage you to reanalyze your example: 242 characters resulting in 74 tokens is a 3:1 character-to-token ratio. This is not significantly different from your previous test, which showed a 2:1 ratio (38 tokens out of 77 characters). The shift from a 2:1 to a 3:1 ratio due to space indenting is relatively minor and certainly not a wild variance. Let’s dive deeper into this.

Notice in your visual example that at the beginning of each line, there is a cluster of spaces, yet they are represented as a single color block. This indicates that no matter how many spaces are used, they are treated as part of the same tokenization. Comparing the two token counts from your examples—74 tokens versus 38 tokens—reveals a difference of 36 tokens. Given that there appear to be about 19 lines of space indenting, this suggests roughly 2 tokens per cluster of spaces.

What this indicates is that once the JSON pattern is recognized, the amount of spacing used for indenting at the beginning of lines becomes irrelevant; the token count remains consistent, not varying greatly as suggested. In fact, it doesn’t matter if your cluster block had two spaces or eight—it was still counted the same way as all the others. If you were to go back and reduce the spacing by half, you would likely end up with the same or a very similar number of tokens.

I’m not trying to split hairs or nitpick semantics here; it’s important that we remain congruent in understanding how LLMs like ChatGPT function. The only way your sentiment would be correct is if each space itself counted as a token, which would indeed make the complexity of your structure affect the token cost. However, as your own example demonstrates, the number of spaces doesn’t significantly impact the token count, so the complexity of nested information is not directly proportional to token costs.

That said, returning to my original point, if you are properly using hierarchy within JSON, then space indenting becomes superfluous. If you’re concerned about token usage, I would recommend avoiding space indenting altogether.

1 Like

What exactly do you mean by “properly using hierarchy within JSON”?

The standard way of using JSON is using a hierarchy.

Answer this question: What is the point of ‘{’ within JSON formatting?

Thanks all - so results so far seem to indicate

  • We don’t know whether LLMs will deal with minified data any better or worse than indented.
  • Indented does indeed consume more tokens.

Open question: What is the most efficient way to convey structured data to an LLM? I could picture a few

Minified
{"outer": {"inner: [{"a": [{"x": 5}], "b": 2}. {"a": [{"x": 9}], "b": 4}]}}

YAML-style: Pure indentation, no brackets
outer: Map
 inner: List
   - a: List
      1: ...

Redunant... Both indentation and brackets 
{
"outer": 
    "inner": [
       
    ...
}


Deep-key value pairs: 
outer.inner.0.a.0.x: 5
outer.inner.0.b: 2
outer.inner.1.a.0.x: 9
outer.inner.1.b: 4

... Some hybrid?
outer.inner: 
   - a: {x: [5]}, b: 2
   - a: {x: [9]}, b: 4

I guess the competing objectives are
A) token-length - Number of tokens needed to describe structure (minimized by minified form)
B) average-context-needed to locate-object in heirarchy - How big a window of text you need to look at to determine where any element is in the heirarchy. Minimized by deep-key-value form.

And I guess the open question is - does metric B actually matter at all in practice?

2 Likes

Yes, we do know this. Whether you use minified or identing, it will understand it just fine. However, if you use proper hierarchy within JSON, which is the standard but not everyone does the standard, then you don’t need to do indenting. The AI will essentially ignore the identing. Because you establish a hierarchy within the block.

In fact, the less tokens you use, the more efficient the AI becomes. The more tokens you use, the more the AI has analyze which it is more likely to skim the information and not look at all of it.

There is a flaw in your question though. “What is the most efficient way to convey structured data to an LLM?” The most efficient way to convey structured data to an LLM is dependent on your intentions. Different structures can be the most efficient depending on what your needs are.

Each one of your examples you use has different strengths and weaknesses, so what is the most efficient way to convey structure data is all dependent on your overall intent. But be aware that using hierarchy, the AI is going to pick up additional insight that you may not be aware of.

Arguably the model would be exposed to more properly formatted JSON in it’s training as it’s based on real-life examples.

In fact, I would go as far to say that providing a large minified JSON example could have adverse effects. I have noticed that people who post these types of objects get less help (because they’re a pain to read).

So maybe a small object of a couple properties: sure.

For a large object however, because both inside of files they are structured this way, and because they are this way on the internet, I would imagine that the model has much more training data on it, compared to an unorthodox large yet minified object.

That being said. I think that evals would be a perfect fit for this situation.

1 Like

So here’s a quick mashup I did to try and help figure this out.

Comparing w/ an easy lookup

Comparing with a nested look up


From this (very limited) test I would say
Does GPT understand indented JSON better than minified?: Yes
Is it by a negligible amount?: Questionable.
Could this code be improved: Yes, 100% :joy:

May be of interest to try out different objects and styles.

minified = """{"outer":"sup":1,{"inner":[{"a":[{"x":5}],"b":2},{"a":[{"x":9}],"b":4}]}}"""
full = """{
    "outer": {
        "sup": 1,
        "inner": [
            {
                "a": [
                    {
                        "x": 5
                    }
                ],
                "b": 2
            },
            {
                "a": [
                    {
                        "x": 9
                    }
                ],
                "b": 4
            }
        ]
    }
}"""

master = []

for obj in [minified, full]:
    results = []
    messages =[
        {"role": "system", "content": "You are given a JSON object by the user, along with a value, and must return the value of the key in the object. You must ONLY return the value"},
        {"role": "user", "content": f"### Object\n{obj}\n\n### Key\nouter['inner'][1]['a'][0]['x']"},
    ]


    choices = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        max_tokens=1,
        temperature=0,
        n=10,
        logprobs=True
    ).choices
    for choice in choices:
        results.append( 
            [
                choice.message.content,
                round(math.exp(choice.logprobs.content[0].logprob)*100)
            ]
        )
    master.append(results)
4 Likes

This is a long thread and someone may have already suggested this but I have a function which serializes objects to both JSON and YAML, counts the number of tokens in each, and embeds which ever one is shorter in the prompt.

From my experience the model doesn’t generally care what the structure is of the information is. The model is recognizing patterns in the data much like we do and just like we can ignore errors in the formatting of an input, so can the model.

My goal is generally to use the fewest tokens as possible in the prompt so switching to YAML when it’s shorter achieves that.

The one caveat I’ll offer is if you’re asking for JSON back you should show the model JSON in. Don’t worry about indentation though. The model doesn’t need it.

1 Like

I personally prefer YAML as it typically tokenizes better than even compact JSON and remains human-readable.

I wish OpenAI supported it as a structured output format.

I like YAML too but as an output format it’s super sensitive to things like indentation which these models struggle with. The way that these structured outputs work at the model level means you could enforce an output like YAML but keep in mind that nothing is free. They actually have to use fine tuning to help make it easier for the model to return structured JSON so if they also had to tune it to return YAML it would not only take twice as much fine tuning but they would be putting the model in a position where some other feature in it’s latent space needs to get dropped to make room for these new features.

2 Likes

Mmmmm… I’m not sold on that theory. I don’t know that it’s a zero-sum game there because, to me, that would imply that the models are fully-saturated meaning all the weights are perfectly efficient and there are no redundancies.

The fact distillation works as well as it does—models can often be reduced in size by upwards of an order of magnitude while maintaining 90%–95% of the parent model performance—leads me to believe there is a lot of redundancy in the models.

You are absolutely correct about requiring more training investments though. I get why they went with JSON though. It’s a much better format if you intend on doing further processing of the data programmatically.

I think at some point there will need to be a new structured data format designed from the ground up for transformer models. Something which is hyper-efficient from a tokenization standpoint but also doesn’t strain the attention mechanism.

1 Like

To be clear I’m a huge fan of structured outputs even though I try to avoid them as much as I can… I lean towards algorithms and prompts that are robust to the model returning poorly structured responses. We work with a lot of models (gpt, claude, gemini, llama, mixtral, etc.) of varying sizes from small to large. These models are all over the map reliability wise and we make thousands of calls to them to do what we have to do. I basically work under the assumption that every 10 - 100 responses I get back will be wrong in some way so I design everything with that assumption in mind. I do a lot of post processing of responses but I generally just try to design algorithms that say “do your best”

I actually really like asking for markdown back… The models seen a ton of it so its good at generating it. It’s semi-structured so you can actually take a crack at parsing it and worst case a value will just slip over to another block of text but you won’t lose information.

2 Likes

:rofl:

This x 100!

I have a long running refrain: “don’t fight the model” which I repeat as nauseum whenever I see a topic where someone is asking “how can I make the model always do…?” With respect to some type of output format or structure.

You can’t! Don’t try! Let the model do what it’s going to do and make like a bad film director and resolve to fixing it in post.

The number of people completely willing to bang their heads against the model for hours, days, or weeks to try to beat it into submission but who won’t ask ChatGPT to write a little regex to edit the output to conform to their needs is maddening!

I nominate you to give the TEDxOpenAI-Developer-Forum talk on the subject, because people ain’t listening to me! :rofl:

3 Likes