Estimating OpenAI GPT-3.5-Turbo usage costs for french inputs, is this the right approach?


I have a corpus of french documents that will undergo the same processing using OpenAI. I’ll be extracting information from the texts using french prompts.
The prompts will be constituted of the text itself + the question specifying the task we’d like to accomplish.
I am using TikToken to estimate the number of tokens and my code is as follows:

import tiktoken

encoding = tiktoken.get_encoding("cl100k_base")
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

def count_token(text):
    text = str(text)
    return num_tokens_from_string(text, "cl100k_base")

df['estimation'] = df['text'].apply(count_token)

Is this the right approach for the French language?
After having an estimate of the number of tokens, we’re multiplying this by 0.002$/1k token to get a rough estimate of the total price. Is this approach valid?
Does the number of tokens include the output / generated tokens as well?

Thanks in advance for your help

Tokens for 3.5-Turbo are the same cost for prompts and completions (sent in and returned out) so 0.002/1k tokens is the correct calculation, so you should perform this check on the totality of the prompt sent to the model and the result returned from it.

1 Like