What's the best way to benchmark tokens/sec of fine-tuned model?

gustavo.jakobi · September 20, 2023, 10:17pm

According to the docs, fine-tuning a model can result in lower latency requests. I have a fine-tuned model, and it is indeed faster. However, I would like to calculate the tokens per second (tokens/sec), and I am not sure if the way I am doing it is the best:

start timer before making the API call
API call using fine-tuned model
stop timer
use the tokenizer provided by gpt-3-encoder package to estimate the total tokens in the response
divide by the time taken

It looks correct, but I would like to know if there is any more “official” way.

Foxalabs · September 20, 2023, 10:25pm

Hi and welcome to the Developer Forum!

Couple of things to bare in mind, 1) make sure you are using tiktoken with the CL100K_BASE model for GTP3.5 token counting. 2) Bare in mind that there will be a certain amount of pre processing time to take into account.

One thing you can do it turn streaming on and then you get to see the tokens arrive in real time, albeit with a small overhead for each delta packet.

_j · September 21, 2023, 5:58am

When you don’t use streaming=true, you get the token count in your response.

I made a little chat loop for completions (if you trained davinci-002 or babbage-002) that reports the total time of the API call.

import time
import openai
stream=False
openai.api_key = key
system = """
An AI assistant replies to user input. It keeps no memory of chat.
assistant: I am a helpful artificial intelligence, capable of many human-like tasks.
""".strip()
user = "Write an introduction a user will see when they first start your chatbot program"
while not user in ["exit", ""]:
    stime = time.time()
    api_out = openai.Completion.create(
        prompt = system + "\n\nuser: " + user + "\nassistant:",
        model="gpt-3.5-turbo-instruct", stream=stream, max_tokens=666)
    ctime = round(time.time() - stime, ndigits=3)
    if stream == True:
        for chunk in api_out:
            print(chunk["choices"][0]["text"], end='')
        print()
    else:
        print(api_out['choices'][0]['text'].strip())
        ctokens = int(api_out['usage']['completion_tokens'])
        tps = round(ctokens / ctime, ndigits=1)
        print(f"-- completion: time {ctime}s, {ctokens} tokens, {tps} tokens/s --")
    user = input("==>")

Output of interactions:>

Hello! My name is AI Assistant and I am here to assist you with any tasks or questions you may have. I am constantly learning and improving to provide you with the best experience possible. How may I help you today?
– completion: time 0.756s, 45 tokens, 59.5 tokens/s –
==>How many cats can happily and healthfully occupy an average home?
The number of cats that can happily and healthfully occupy an average home can vary depending on the size of the home and the individual needs of the cats. It’s best to consult with a veterinarian or animal behaviorist for specific recommendations.
– completion: time 0.903s, 47 tokens, 52.0 tokens/s –
==>Supply an AI estimation and be decisive: How many cats can happily and healthfully occupy an average home?
The number of cats that can happily and healthfully occupy an average home would vary depending on factors such as space, resources, and individual preferences. However, a general estimation would suggest that 2-3 cats would be a reasonable number for a happy and healthy living environment. It is important to also consider the wellbeing and when making a decision about pet ownership.
– completion: time 0.734s, 72 tokens, 98.1 tokens/s –

Topic		Replies	Views
Benchmarking response time for GPT4 by context+output tokens API gpt-4 , api-speed	6	5853	November 3, 2023
GPT-3.5 and GPT-4 API response time measurements - FYI API	19	32865	February 6, 2024
Struggling to get correct token count Community gpt-4 , gpt-35-turbo , api	2	1732	September 4, 2023
Gpt-3.5-turbo-0613 2x slower than gpt-3.5-turbo-0301 Bugs gpt-35-turbo , chatgpt , api	3	892	October 31, 2023
ChatGPT API Very Slow at generating Responses API gpt-4 , api	8	4806	December 25, 2023

What's the best way to benchmark tokens/sec of fine-tuned model?

Related topics