Price for Fine Tuning with gpt-3.5 turbo and gpt-4

How to estimate the price of fine tuning with gpt 3.5-turbo 1106?
What’s the price for fine tuning with gpt-4?

Welcome to the forum.

I believe you meant fine-tuning something (gpt-3.5/4), not with something.

If this is the case, you can see the price on the OpenAI pricing page for 3.5. GPT-4 you cannot fine tune yet.

Model Training Input usage Output usage
gpt-3.5-turbo $0.0080 / 1K tokens $0.0030 / 1K tokens $0.0060 / 1K tokens

fine-tune accepts a training file, with many example conversations showing the kind of new response the AI should produce for that style of input.

Each message within an example conversation and final completion, such as [system, user, assistant] has language the AI model receives, encoded by a BPE tokenizer. An exact calculation of the contents of the JSON lines can be made by tiktoken, which is a library that encodes language the same way as the cost is calculated.

Additionally, the size of the file in total tokens is multiplied by the n_epochs parameter, which is the number of reinforcement learning passes performed on the training file. Expect about 8 epochs on a small file if you leave the epochs parameter unspecified to be auto-configured by OpenAI.

That is the “training” cost above.

Using the model also costs considerably more than the base, but for particular applications, hopefully you can improve the quality beyond that even offered by more expensive models and more prompting.

1 Like

You can estimate the cost using the following formula:

Or, using this source code in python:

def cost_estimation(training_file_name: str, model: str='gpt-3.5-turbo-0125') -> None:
    def num_tokens_from_messages(messages: list) -> int:
        tokens_per_message = 3
        tokens_per_name= 1

        num_tokens = 0
        for message in messages:
            num_tokens += tokens_per_message
            for key, value in message.items():
                num_tokens += len(encoding.encode(value))
                if key == "name":
                    num_tokens += tokens_per_name
        num_tokens += 3
        return num_tokens

    encoding = tiktoken.encoding_for_model(model)
    convo_lens = list()
    dataset = list()
    n_messages = list()

    with open(training_file_name, mode='r', encoding='utf-8') as f:
        for line in f:
            dataset.append(json.loads(line))        

    print("Num examples:", len(dataset))

    for ex in dataset:
        messages = ex["messages"]    
        n_messages.append(len(messages))
        convo_lens.append(num_tokens_from_messages(messages))    

    n_epochs = TARGET_EPOCHS
    n_train_examples = len(dataset)
    
    if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
        n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
    elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
        n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

    print("Num epochs:", n_epochs)
    n_billing_tokens_in_dataset = sum(length for length in convo_lens)
    final_cost = (BASE_COST_PER_1K_TOKENS * n_billing_tokens_in_dataset * n_epochs ) / 1000 

    print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
    print(f"By default, you'll train for {n_epochs} epochs on this dataset")
    print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")        
    print(f"Final cost will be  $ {final_cost}")

cost_estimation("/tmp/training.jsonl")