How temperature impacts fine tuned models?

Hello Friends,

Whenever I fine-tune GPT-3.5-Turbo, I face this challenge where I am unsure about the temperature I should set while testing it.

Do we have any documentation for understanding this?

No hard rules.

@eawestwrites has been doing fine-tuning and just mentioned temp can be finicky on fine-tunes… Start lower like .5 to .7 and test, test, test.


Got it!

Also, do we always have to use the same system message for each conversation while fine-tuning?

What’s the role of system message in general while fine-tuning?

Exactly @PaulBellow

I could be wrong on this, but there’s not a lot of documentation for non-ML types on this technology. So please correct me if this analogy is wrong.

A non-fine tuned model has a vector field arranged on the core training the model had. The tokens have addresses in what I always imagine as a galaxy of stars, those addresses are in relationship to each other based on the totality of massive amounts of training data the model had on those tokens.

A fine-tune allows you to skew those addresses… slightly. But slightly enough that it impacts the way the LLM selects tokens.

In the absence of a fine tune, we use hyperparameters, like Temperature and Top_P and Presence penalty etc. to skew those addresses/or the retrieval of tokens from the addresses.

So where a Temperature adjustment in layman’s terms is a generic skew, your fine tune is a specific skew to your needs. And a skew on top of a skew? Well, as my Dad always said “Tight is tight, too tight is broke.”

That is my hypothesis on why a lower Temperature and Top_P yields better results on a fine-tune than the values you would use on the base model.


Can you shed some light on this?


consider a token certainties to come about from an input like:

  • pretraining on corpus
  • sparsity, quantization, and other model optimization
  • additional learning on fine-tune contexts ->reweighting
  • attention and masking

Then from all tokens and the likelihood they are an appropriate choice as the next generation, in order:

  • logit_bias parameter
  • top_p nucleus sampling subset
  • temperature rescaling of liklihood

I propose that fine-tune gives just a few top tokens that align with its goals, less semantically based, so the tail of token choices fallls in quality quickly.

system message has two functions:

  • it has actual instructions to be followed
  • it sets up the pattern matching that shows a fine-tune path is to be followed

so if you’ve trained on two behaviors in a fine-tune, they can be brought out more strongly by using the same starting sequence:

  • The AI is a backend processor of data, converting it into RabbitText.
  • You are SharkBait, a user’s best chum.
1 Like