Just so you know, using JSON the standard way as you are, you don’t need space indenting because the AI can already detect a hierarchy. At this point, the space indenting is for your benefit, not the AI’s. As I initially said, in terms of ‘nuance philosophical considerations,’ space indenting can help the AI see the hierarchy, but that’s only if you don’t establish a hierarchy within JSON. I’ve seen some who don’t do that, which is why I phrased it as I did. To clarify my point: if you have a hierarchy within the JSON format, then you don’t need space indenting. Otherwise, lacking a clear hierarchy within JSON, you do need space indenting.
My testing was done using Python with the ‘GPT2Tokenizer,’ which, as I understand, is an open-source version that underpins many GPT models. This tokenizer is quite different from those used in GPT-3.5 and GPT-4. Different tokenizers handle patterns in distinct ways. When I used the GPT2Tokenizer in Python, my focus was on accurately identifying all tokens. If I had taken JSON formatting into account, it might have influenced how the tokenizer counted tokens, particularly in relation to JSON patternization.
Looking at your example, the increase in tokens is not drastic but rather predictable within the context of the JSON format. You mentioned that the number of tokens ‘can vary greatly when indenting JSON,’ but that is not entirely accurate. I would encourage you to reanalyze your example: 242 characters resulting in 74 tokens is a 3:1 character-to-token ratio. This is not significantly different from your previous test, which showed a 2:1 ratio (38 tokens out of 77 characters). The shift from a 2:1 to a 3:1 ratio due to space indenting is relatively minor and certainly not a wild variance. Let’s dive deeper into this.
Notice in your visual example that at the beginning of each line, there is a cluster of spaces, yet they are represented as a single color block. This indicates that no matter how many spaces are used, they are treated as part of the same tokenization. Comparing the two token counts from your examples—74 tokens versus 38 tokens—reveals a difference of 36 tokens. Given that there appear to be about 19 lines of space indenting, this suggests roughly 2 tokens per cluster of spaces.
What this indicates is that once the JSON pattern is recognized, the amount of spacing used for indenting at the beginning of lines becomes irrelevant; the token count remains consistent, not varying greatly as suggested. In fact, it doesn’t matter if your cluster block had two spaces or eight—it was still counted the same way as all the others. If you were to go back and reduce the spacing by half, you would likely end up with the same or a very similar number of tokens.
I’m not trying to split hairs or nitpick semantics here; it’s important that we remain congruent in understanding how LLMs like ChatGPT function. The only way your sentiment would be correct is if each space itself counted as a token, which would indeed make the complexity of your structure affect the token cost. However, as your own example demonstrates, the number of spaces doesn’t significantly impact the token count, so the complexity of nested information is not directly proportional to token costs.
That said, returning to my original point, if you are properly using hierarchy within JSON, then space indenting becomes superfluous. If you’re concerned about token usage, I would recommend avoiding space indenting altogether.