Does ChatGPT understand indented JSON better than minified?

Yeah… Sometimes I have to laugh at the mistakes the model makes. It’s like what??? GPT-4o is REALLY GOOD but it’s stochastic and even with temp of 0.0 it’s going to have variances and slip up from time to time. I would say that I’m pretty good at prompting and there’s no prompt you can write that’s going to fix that.

Just because the model makes mistakes doesn’t mean it’s not useful. There’s definitely an art to prompting these models but there’s just as much an art around how you shape the problems you give them. As programmers we’re not used to programming for a system that’s non-deterministic and you have to approach things differently with these models.

Facts.

These models do so much amazing work. These days in AI development I spend more time building constraints, checks, feedback loops, and, what really feels like a programmatic dance with the model to ensure some level of control, quality & consistency with the output.

I also like markdown. It gives the model much more liberty and chance for expression. It’s my perfect blend of structured & unstructured semantics

Yeah… these days I try to make just a giant feedback loop of markdown. We have a whole document ingestion pipeline where we convert all of the documents we’re going to feed to the model to markdown. So where possible we pass markdown into the model and ask for markdown out.

We balance this out by passing all of our instructions in using XML tags. That way there’s a clearer separation between the instructions we’re passing in and the data we want to reason over

Stumbled over this discussion when deciding to minify my structured output schema or not. It makes no difference for my token input or output counts. Looks like they get minified anyway for structured outputs, which makes sense since the hierarchical information is preserved.

I disagree. The tokenizer does not break apart strings based on words. It breaks them apart based on tokens. What exactly is a token depends on the tokenizer itself! Generally speaking, tokenizers for LLMs break apart strings based on semantics. I’ll give an example, but the tokenization of this example is probably not how ChatGPT tokenizes, but it might.

“He jumped over the crosswalk.” may be tokenized such that there is one token for each of “he”, “jump”, “ed”, “over”, “the”, “cross”, “walk”, “.”

To know if extra spaces increase the number of tokens or not, one can tokenize their string using something like JTokkit (GitHub - knuddelsgmbh/jtokkit: JTokkit is a Java tokenizer library designed for use with OpenAI models.)

I do not know how to answer the question of whether spaces help ChatGPT or not. I’m wondering about this myself as well as the way to encode something into JSON that best helps ChatGPT understand the encoded data.

Perhaps this gpt-4.1 prompting guide can help a bit.

Guidance specifically for adding a large number of documents or files to input context:

  • XML performed well in our long context testing.
    • Example: <doc id='1' title='The Fox'>The quick brown fox jumps over the lazy dog</doc>
  • This format, proposed by Lee et al. (ref), also performed well in our long context testing.
    • Example: ID: 1 | TITLE: The Fox | CONTENT: The quick brown fox jumps over the lazy dog
  • JSON performed particularly poorly.
    • Example: [{'id': 1, 'title': 'The Fox', 'content': 'The quick brown fox jumped over the lazy dog'}]

If programmatic access is not needed for tokenizing, then this interactive, web-based tokenizer provided by OpenAI can also be used. For Python, the tiktoken package can be used, and for JavaScript this one.