Rules of Thumb for number of source code characters to tokens

restlessronin · February 12, 2024, 6:13am

I’m trying to estimate how many tokens will be required for (summaries of) source code files, in a variety of programming languages.

Are there any rules of thumb for this mapping? for natural languages it appears that 1 token ~ 4 characters is the average. Is that reasonable for source code as well?

Macha · February 12, 2024, 7:23am

Hey there!

You’re using the API right? Why not calculate the precise amount with tiktoken?

restlessronin · February 12, 2024, 10:26am

I’m writing a light-weight javascript library that could be used by other LLM’s as well. I don’t want to force a dependency on a specific tokenizer. The library user can specify the tokenizer, but I want to provide a simple default that will be in the right ballpark.

_j · February 12, 2024, 10:56am

You’ll have to use some tokenizer to make your rule of thumb. You could thumb your nose at OpenAI and also give some weight to Llama token dictionaries that are more like 32k instead of 100k.

Python: 4.2 char per OpenAI token. Closer to 3 at worst BPE sentencepiece token encoders.

400 tokens of Mixtral 46.7B = 1291 characters = 299 tokens of OpenAI cl100k

# Make API call to OpenAI
c = None
try:
    c = client.chat.completions.with_raw_response.create(**params)
except Exception as e:
    print(f"Error: {e}")

# If we got the response, load a whole bunch of demo variables
# This is different because of the 'with raw response' for obtaining headers
if c:
    headers_dict = c.headers.items().mapping.copy()
    for key, value in headers_dict.items():
        variable_name = f'headers_{key.replace("-", "_")}'
        globals()[variable_name] = value
    remains = headers_x_ratelimit_remaining_tokens  # show we set variables
    print(c.content.decode())
    api_return_dict = json.loads(c.content.decode())
    api_finish_str = api_return_dict.get('choices')[0].get('finish_reason')
    usage_dict = api_return_dict.get('usage')
    api_message_dict = api_return_dict.get('choices')[0].get('message')
    api_message_str = api_return_dict.get('choices')[0].get('message').get('content')
    api_tools_list = api_return_dict.get('choices')[0].get('message').get('tool_calls')
    print("----------")
    # print any response always
    if api_message_str:
        print(api_message_str)

    # print all tool functions pretty
    if api_tools_list:
        for tool_item in api_tools_list:
            print

webpack minified ChatGPT client javascript? 2.5 char per token.

restlessronin · February 13, 2024, 5:56am

Thanks. I also found this older post that was interesting.

I finally picked a simple character count for the samples, and left the choice of tokenizer to the library client code.

LinqLover · February 14, 2024, 3:02pm

Likewise, typical Smalltalk code has between 3.3 and 3.8 characters per token.

github.com

LinqLover/SemanticSqueak/blob/5b29f7fb3015e2b704b8a62838b6b921198727e3/packages/SemanticSqueak.package/SesqMethodCorpus.class/class/numbers.st

documentation
numbers

	"A typical Squeak image contains about 80-90k methods with 250-350 characters per methods on average. Assuming OpenAI's ada-002 text embedding model, 1 character of Smalltalk sources averages to 0.26-0.3 tokens, the average document is 6224 bytes large per document, and a full image corpus will consume about 560 MB."

Topic		Replies	Views
How to do a quick estimation of token count of a text? API chatgpt , api , token	2	6381	June 26, 2025
Counting tokens for chat API calls (gpt-3.5-turbo) Documentation	5	27955	December 13, 2023
What is the OpenAI algorithm to calculate tokens? API	35	31369	December 13, 2023
Struggling to get correct token count Community gpt-4 , gpt-35-turbo , api	2	1960	September 4, 2023
How does GPT-3 cost calculation for languages other than English? API	7	4435	February 20, 2023

Rules of Thumb for number of source code characters to tokens

Related topics