Rules of Thumb for number of source code characters to tokens

I’m trying to estimate how many tokens will be required for (summaries of) source code files, in a variety of programming languages.

Are there any rules of thumb for this mapping? for natural languages it appears that 1 token ~ 4 characters is the average. Is that reasonable for source code as well?

Hey there!

You’re using the API right? Why not calculate the precise amount with tiktoken?

1 Like

I’m writing a light-weight javascript library that could be used by other LLM’s as well. I don’t want to force a dependency on a specific tokenizer. The library user can specify the tokenizer, but I want to provide a simple default that will be in the right ballpark.

You’ll have to use some tokenizer to make your rule of thumb. You could thumb your nose at OpenAI and also give some weight to Llama token dictionaries that are more like 32k instead of 100k.

Python: 4.2 char per OpenAI token. Closer to 3 at worst BPE sentencepiece token encoders.

400 tokens of Mixtral 46.7B = 1291 characters = 299 tokens of OpenAI cl100k
# Make API call to OpenAI
c = None
try:
    c = client.chat.completions.with_raw_response.create(**params)
except Exception as e:
    print(f"Error: {e}")

# If we got the response, load a whole bunch of demo variables
# This is different because of the 'with raw response' for obtaining headers
if c:
    headers_dict = c.headers.items().mapping.copy()
    for key, value in headers_dict.items():
        variable_name = f'headers_{key.replace("-", "_")}'
        globals()[variable_name] = value
    remains = headers_x_ratelimit_remaining_tokens  # show we set variables
    print(c.content.decode())
    api_return_dict = json.loads(c.content.decode())
    api_finish_str = api_return_dict.get('choices')[0].get('finish_reason')
    usage_dict = api_return_dict.get('usage')
    api_message_dict = api_return_dict.get('choices')[0].get('message')
    api_message_str = api_return_dict.get('choices')[0].get('message').get('content')
    api_tools_list = api_return_dict.get('choices')[0].get('message').get('tool_calls')
    print("----------")
    # print any response always
    if api_message_str:
        print(api_message_str)

    # print all tool functions pretty
    if api_tools_list:
        for tool_item in api_tools_list:
            print

webpack minified ChatGPT client javascript? 2.5 char per token.

3 Likes

Thanks. I also found this older post that was interesting.

I finally picked a simple character count for the samples, and left the choice of tokenizer to the library client code.

1 Like

Likewise, typical Smalltalk code has between 3.3 and 3.8 characters per token.