Does ChatGPT understand indented JSON better than minified?

stevenic · August 21, 2024, 3:19pm

Yeah… Sometimes I have to laugh at the mistakes the model makes. It’s like what??? GPT-4o is REALLY GOOD but it’s stochastic and even with temp of 0.0 it’s going to have variances and slip up from time to time. I would say that I’m pretty good at prompting and there’s no prompt you can write that’s going to fix that.

Just because the model makes mistakes doesn’t mean it’s not useful. There’s definitely an art to prompting these models but there’s just as much an art around how you shape the problems you give them. As programmers we’re not used to programming for a system that’s non-deterministic and you have to approach things differently with these models.

anon10827405 · August 21, 2024, 3:49pm

Facts.

These models do so much amazing work. These days in AI development I spend more time building constraints, checks, feedback loops, and, what really feels like a programmatic dance with the model to ensure some level of control, quality & consistency with the output.

I also like markdown. It gives the model much more liberty and chance for expression. It’s my perfect blend of structured & unstructured semantics

stevenic · August 21, 2024, 4:03pm

Yeah… these days I try to make just a giant feedback loop of markdown. We have a whole document ingestion pipeline where we convert all of the documents we’re going to feed to the model to markdown. So where possible we pass markdown into the model and ask for markdown out.

We balance this out by passing all of our instructions in using XML tags. That way there’s a clearer separation between the instructions we’re passing in and the data we want to reason over

morietschel · November 3, 2024, 6:49am

Stumbled over this discussion when deciding to minify my structured output schema or not. It makes no difference for my token input or output counts. Looks like they get minified anyway for structured outputs, which makes sense since the hierarchical information is preserved.

jzwolak · July 2, 2025, 7:58pm

I disagree. The tokenizer does not break apart strings based on words. It breaks them apart based on tokens. What exactly is a token depends on the tokenizer itself! Generally speaking, tokenizers for LLMs break apart strings based on semantics. I’ll give an example, but the tokenization of this example is probably not how ChatGPT tokenizes, but it might.

“He jumped over the crosswalk.” may be tokenized such that there is one token for each of “he”, “jump”, “ed”, “over”, “the”, “cross”, “walk”, “.”

To know if extra spaces increase the number of tokens or not, one can tokenize their string using something like JTokkit (GitHub - knuddelsgmbh/jtokkit: JTokkit is a Java tokenizer library designed for use with OpenAI models.)

I do not know how to answer the question of whether spaces help ChatGPT or not. I’m wondering about this myself as well as the way to encode something into JSON that best helps ChatGPT understand the encoded data.

aprendendo.next · July 2, 2025, 8:05pm

Perhaps this gpt-4.1 prompting guide can help a bit.

Guidance specifically for adding a large number of documents or files to input context:

XML performed well in our long context testing.

Example: <doc id='1' title='The Fox'>The quick brown fox jumps over the lazy dog</doc>

This format, proposed by Lee et al. (ref), also performed well in our long context testing.

Example: ID: 1 | TITLE: The Fox | CONTENT: The quick brown fox jumps over the lazy dog

JSON performed particularly poorly.

Example: [{'id': 1, 'title': 'The Fox', 'content': 'The quick brown fox jumped over the lazy dog'}]

jai · July 3, 2025, 9:37pm

If programmatic access is not needed for tokenizing, then this interactive, web-based tokenizer provided by OpenAI can also be used. For Python, the tiktoken package can be used, and for JavaScript this one.

Topic		Replies	Views
New 4-turbo model has a unique limit? Or is this a bizarre hallucation? API	18	4737	January 26, 2024
Tokenization fun: long character strings of only 1 token, for prompt and code formatting and more Prompting api	8	1477	December 13, 2024
What is the OpenAI algorithm to calculate tokens? API	34	33899	August 1, 2023
Json format causes infinite "\n \n \n \n" in response API gpt-4 , api , json-mode	21	11060	April 30, 2025
Is the GPT4 api actually this limited or am I doing something wrong? API	12	1674	September 17, 2023

Does ChatGPT understand indented JSON better than minified?

Related topics