Does function calling output charge for white space?

Function calling outputs a JSON with spacing. I’m unsure if that is a formatting done post model and therefore not charged on token count or if it is the direct output of the model and therefore the spacing is charged in the token count. If the latter is the case then that’s very unfortunate given that nested JSON level increases token count by a substantial sum given that the white space for the formatting is quite large.

1 Like

Billing is done by token so yes—as the whitespace must also be tokenized.

As @anon22939549 has said, yes, but it is done intelligently. The tokenizer system has been built including code examples, so long strings of spaces and/or tabs are encoded into a single token.

2 Likes

TL;DR: whitespace is compressed

The structure and white space of indentation and lines may look excessive if we evaluate it purely by the characters. However, because of the very nature of BPE tokenization, these long sequences of spaces are compressed to single tokens.

Let’s analyze a quickie from my shell:

      "message": {
        "role": "assistant",
        "content": null,
        "function_call": {
          "name": "temperature_classification",
          "arguments": "{\n  \"temperature\": 0.3\n}"
        }

The actual carriage returns the AI outputs are represented by \n.

What is the AI-generated language within?
“assistant”? No.
“content”? No.
“function_call”? No.
function name? Yes.
function arguments? Yes.
unseen overhead? about 3 tokens

Let’s examine the contents of “arguments”, which is AI-generated language passed, without alteration.
For an example, one that has multiple parameters and a subarray:
image

The individual tokens are delineated by color. Newlines are not depicted. One or many newlines in sequence all are just one token.

The indentation adds just one token more, regardless of the length.

So in the example with 13 new lines and 12 indents, 25 additional tokens of 135 pictured.

The advantage is that this can be used as python code directly, and has higher comprehension if passing it to another AI.

For reference, 128 spaces is one token…

1 Like

@Foxalabs too…

You’re both seem to be describing the tokenization of an input. I would assume generated whitespace would count regularly for tokens unless there were multiple whitespace tokens in the dictionary.

@_j which tokenizer did you use for your visual representation? It looks like codex which I do believe has many different whitespace tokens, I’m not sure cl100k_base handles whitespace the same (though I could be wrong).

https://tiktokenizer.vercel.app/

You can select the model and receive the correct encoder.

I’m describing the output you receive back from a function call. Since it is the actual AI-generated language, it is more closely related to consumed tokens.

The function input you provide to the API can have as much structure or whitespace as you want, as it is rewritten into a very brief compressed format by the API handler for the AI, which doesn’t use indentation.

1 Like

No worries, I just dug through cl100k_base and see how it compresses spaces and just how many tokens it has dedicated to whitespace, so yeah, generated or not it appears (nearly) any sequence of whitespace characters will be tokenized as a single token. I knew this was the case for p50k_base but I’d not dug into cl100k_base yet to verify it was the same in that regard.

:+1:

I got my learn-something-new-everyday-from-the-forum pretty late in the day today, so thank you for helping me squeak in under the wire.

2 Likes