How temperature impacts fine tuned models?

Hello Friends,

Whenever I fine-tune GPT-3.5-Turbo, I face this challenge where I am unsure about the temperature I should set while testing it.

Do we have any documentation for understanding this?

No hard rules.

@eawestwrites has been doing fine-tuning and just mentioned temp can be finicky on fine-tunes… Start lower like .5 to .7 and test, test, test.

2 Likes

Got it!

Also, do we always have to use the same system message for each conversation while fine-tuning?

What’s the role of system message in general while fine-tuning?

Exactly @PaulBellow

I could be wrong on this, but there’s not a lot of documentation for non-ML types on this technology. So please correct me if this analogy is wrong.

A non-fine tuned model has a vector field arranged on the core training the model had. The tokens have addresses in what I always imagine as a galaxy of stars, those addresses are in relationship to each other based on the totality of massive amounts of training data the model had on those tokens.

A fine-tune allows you to skew those addresses… slightly. But slightly enough that it impacts the way the LLM selects tokens.

In the absence of a fine tune, we use hyperparameters, like Temperature and Top_P and Presence penalty etc. to skew those addresses/or the retrieval of tokens from the addresses.

So where a Temperature adjustment in layman’s terms is a generic skew, your fine tune is a specific skew to your needs. And a skew on top of a skew? Well, as my Dad always said “Tight is tight, too tight is broke.”

That is my hypothesis on why a lower Temperature and Top_P yields better results on a fine-tune than the values you would use on the base model.

2 Likes

Can you shed some light on this?

Thanks!

consider a token certainties to come about from an input like:

  • pretraining on corpus
  • sparsity, quantization, and other model optimization
  • additional learning on fine-tune contexts ->reweighting
  • attention and masking

Then from all tokens and the likelihood they are an appropriate choice as the next generation, in order:

  • logit_bias parameter
  • top_p nucleus sampling subset
  • temperature rescaling of liklihood

I propose that fine-tune gives just a few top tokens that align with its goals, less semantically based, so the tail of token choices fallls in quality quickly.

system message has two functions:

  • it has actual instructions to be followed
  • it sets up the pattern matching that shows a fine-tune path is to be followed

so if you’ve trained on two behaviors in a fine-tune, they can be brought out more strongly by using the same starting sequence:

  • The AI is a backend processor of data, converting it into RabbitText.
  • You are SharkBait, a user’s best chum.
1 Like

I received the response below from GPT-3 (gpt-3.5-turbo-0125). From this, I would take Temp of 0.7 and Top-p of 0.9 as the settings most appropriate to use in subsequent queries in the Fine Tuned model for the results and randomness implied in the Fine Tuning training process.

Any variation in future queries on the fine tuned model thereafter, would be based on similar considerations as for an untrained model.

This is the approach I intend to adopt. I am not expert and cannot attest to this from wide experience.


fine-tuned models, default settings for Temperature and Top-p

In OpenAI’s fine-tuned models, the default settings for Temperature and Top-p are typically set to specific values that are optimized for balanced and high-quality text generation. Here are the typical default settings:

  1. Temperature: Generally, the default temperature setting in OpenAI’s fine-tuned models is around 0.7. Temperature controls the randomness of the generated text. Lower temperatures result in more deterministic output, while higher temperatures lead to more diversity and randomness in the generated text.

  2. Top-p (nucleus sampling): The default setting for Top-p is often around 0.9. Top-p (nucleus sampling) controls the diversity of the generated text by restricting the sampling to the most likely tokens whose cumulative probability exceeds a certain threshold (p). This helps in avoiding low-quality or nonsensical generated text.

These default settings are designed to provide a good balance between coherence and diversity in text generation. However, you can always customize these parameters based on your specific use case and requirements to achieve the desired output quality.

I once forgot to set the temperature of one of my fine-tuned models and it defaulted to some other value higher than zero. In this case, it resulted in nearly 30-40% error rates and sometimes garbled output (i.e. non-existent words, grammar errors). I’ve since stuck to 0 and never looked back.

I think though it’s use case specific. As Paul suggested, testing is the way to go.

1 Like