How temperature impacts fine tuned models?

Hello Friends,

Whenever I fine-tune GPT-3.5-Turbo, I face this challenge where I am unsure about the temperature I should set while testing it.

Do we have any documentation for understanding this?

No hard rules.

@eawestwrites has been doing fine-tuning and just mentioned temp can be finicky on fine-tunes… Start lower like .5 to .7 and test, test, test.

2 Likes

Got it!

Also, do we always have to use the same system message for each conversation while fine-tuning?

What’s the role of system message in general while fine-tuning?

Exactly @PaulBellow

I could be wrong on this, but there’s not a lot of documentation for non-ML types on this technology. So please correct me if this analogy is wrong.

A non-fine tuned model has a vector field arranged on the core training the model had. The tokens have addresses in what I always imagine as a galaxy of stars, those addresses are in relationship to each other based on the totality of massive amounts of training data the model had on those tokens.

A fine-tune allows you to skew those addresses… slightly. But slightly enough that it impacts the way the LLM selects tokens.

In the absence of a fine tune, we use hyperparameters, like Temperature and Top_P and Presence penalty etc. to skew those addresses/or the retrieval of tokens from the addresses.

So where a Temperature adjustment in layman’s terms is a generic skew, your fine tune is a specific skew to your needs. And a skew on top of a skew? Well, as my Dad always said “Tight is tight, too tight is broke.”

That is my hypothesis on why a lower Temperature and Top_P yields better results on a fine-tune than the values you would use on the base model.

2 Likes

Can you shed some light on this?

Thanks!

consider a token certainties to come about from an input like:

  • pretraining on corpus
  • sparsity, quantization, and other model optimization
  • additional learning on fine-tune contexts ->reweighting
  • attention and masking

Then from all tokens and the likelihood they are an appropriate choice as the next generation, in order:

  • logit_bias parameter
  • top_p nucleus sampling subset
  • temperature rescaling of liklihood

I propose that fine-tune gives just a few top tokens that align with its goals, less semantically based, so the tail of token choices fallls in quality quickly.

system message has two functions:

  • it has actual instructions to be followed
  • it sets up the pattern matching that shows a fine-tune path is to be followed

so if you’ve trained on two behaviors in a fine-tune, they can be brought out more strongly by using the same starting sequence:

  • The AI is a backend processor of data, converting it into RabbitText.
  • You are SharkBait, a user’s best chum.
2 Likes

I received the response below from GPT-3 (gpt-3.5-turbo-0125). From this, I would take Temp of 0.7 and Top-p of 0.9 as the settings most appropriate to use in subsequent queries in the Fine Tuned model for the results and randomness implied in the Fine Tuning training process.

Any variation in future queries on the fine tuned model thereafter, would be based on similar considerations as for an untrained model.

This is the approach I intend to adopt. I am not expert and cannot attest to this from wide experience.


fine-tuned models, default settings for Temperature and Top-p

In OpenAI’s fine-tuned models, the default settings for Temperature and Top-p are typically set to specific values that are optimized for balanced and high-quality text generation. Here are the typical default settings:

  1. Temperature: Generally, the default temperature setting in OpenAI’s fine-tuned models is around 0.7. Temperature controls the randomness of the generated text. Lower temperatures result in more deterministic output, while higher temperatures lead to more diversity and randomness in the generated text.

  2. Top-p (nucleus sampling): The default setting for Top-p is often around 0.9. Top-p (nucleus sampling) controls the diversity of the generated text by restricting the sampling to the most likely tokens whose cumulative probability exceeds a certain threshold (p). This helps in avoiding low-quality or nonsensical generated text.

These default settings are designed to provide a good balance between coherence and diversity in text generation. However, you can always customize these parameters based on your specific use case and requirements to achieve the desired output quality.

I once forgot to set the temperature of one of my fine-tuned models and it defaulted to some other value higher than zero. In this case, it resulted in nearly 30-40% error rates and sometimes garbled output (i.e. non-existent words, grammar errors). I’ve since stuck to 0 and never looked back.

I think though it’s use case specific. As Paul suggested, testing is the way to go.

1 Like

Let’s break this down a bit. Say you’ve got this prompt that’s loaded with instructions. We’re talking few-shot examples, chain of thought reasoning, rules of engagement, etc. Lets say you use finetuned a model using the outputs from this beefy prompt paired with some basic inputs and no system message. You’ve essentially baked all those instructions right into it. It’s like the model’s absorbed all that guidance, so you don’t need to keep repeating those instructions every time. I’m sure you know this but it’s important to say before I continue.

So what happens when you use system messages after this? Well, they take on a different role. Instead of being the main source of instructions, they become more like a gentle nudge. They can slightly steer the model in one direction or another, but they’re not doing the heavy lifting anymore.

Now, here’s a crucial point: if you try to add system messages “while” fine-tuning, like in the fine-tuning training dataset, without adding the proper guidance, you’ll actually make the model less effective at steering. What I mean is, if you use the base responses from all the instructions and try to slip a steering system message in there without modifying the output accordingly, you’ll train the model to make the same responses without modification. Essentially, you’d be teaching the model to ignore those system messages. So be careful with that.

Now, let me throw an example at you to show what I mean with a code generation task:

Before fine-tuning: System: “You’re a Python whiz. Give me well-commented code, explain the algorithm, throw in a docstring and type hints.” User: “Write me a Python function for Fibonacci numbers.” Assistant: [Delivers a basic Python function, might be missing comments, docstring, or type hints]

After fine-tuning, no system message: User: “Write me a Python function for Fibonacci numbers.” Assistant: [Gives you a well-structured function with comments, docstring, and type hints, following the style it was trained on]

After fine-tuning, with system message: System: “Make it memory-efficient.” User: “Write me a Python function for Fibonacci numbers.” Assistant: [Delivers a well-structured function with comments, docstring, and type hints, but now it’s optimized for memory]

See how it plays out? Before fine-tuning, you needed that big system message to get all the details. After fine-tuning, the model’s already on top of those coding best practices without any prompting. And when we throw in a system message post-fine-tuning, it’s just nudging the model towards a specific optimization.