Temperature, top_p and top_k for chatbot responses


I’m using GPT as a chatbot. I have successfully fine-tuned the model on conversation data. For inference I’m now using temperature = 1, top_p = 0.6, and top_k = 35. In the following link it is written that for chatbot responses it is best to use temperature = 0.5 and top_p = 0.5. On the other hand, I have also read elsewhere that temperature = 1 or top_p = 1 should hold.

What values for temperature, top_p and top_k are best to use for chatbot responses? The chatbot should stick to the learned knowledge from the conversation data (and not hallucinate facts) but should also not produce repetitive responses (be somewhat creative).

Today we try out Claude v2:

Nucleus sampling is a technique used in large language models to control the randomness and diversity of generated text. It works by sampling from only the most likely tokens in the model’s predicted distribution.

The key parameters are:

  • Temperature: Controls randomness, higher values increase diversity.

  • Top-p (nucleus): The cumulative probability cutoff for token selection. Lower values mean sampling from a smaller, more top-weighted nucleus.

  • Top-k: Sample from the k most likely next tokens at each step. Lower k focuses on higher probability tokens.

In general:

  • Higher temperature will make outputs more random and diverse.

  • Lower top-p values reduce diversity and focus on more probable tokens.

  • Lower top-k also concentrates sampling on the highest probability tokens for each step.

So temperature increases variety, while top-p and top-k reduce variety and focus samples on the model’s top predictions. You have to balance diversity and relevance when tuning these parameters for different applications.

OpenAI recommends only altering either temperature or top-p from the default.

Top-k is not exposed.

Nucleus sampling parameters alone cannot stop an AI from hallucinating, but they can keep the output on a path of low perplexity. When the temperature is set high, alternate token choices can be made that are not a good fit:

The cause of most astronaut deaths in one word?
Acc = 87.53%
M = 5.81%
Expl = 2.98%
F = 0.60%
Mis = 0.34%

You can see that one mis-step can send the conversation on a whole new course.

Try temperature 0.4 unless you really want unexpected writing. The person chatting about computer code will appreciate it.


Great write-up!

  • I’ll need to try some sample prompts with a few different settings.
  • Any thoughts on setting both temperature and top-p to non-default values (despite recommendations)?

Some notes from my testing, mainly writing code for a specific task, using:

  • Chat Completions
  • gpt-3.5-turbo

Even with the prompt significantly massaged, some instructions in the System Prompt are ignored.

Increasing the temperature to 1.5 almost always gets me the expected behavior - although repeated calls are much less reliable, and the overall cohesion of the answer is compromised.

It’s possible my prompts just need to be streamlined further - get more specific, a little more verbose?

Several calls with the same Prompts usually gets me enough good code to get around any mistakes.

This is not ideal, and I will be trying the configs you have mentioned.

softmax temperature can be though of as the amount of noise injected into the decision-making process

top-p can be considered a weighting that pushes more towards selecting top results

The current models don’t need the temperature increased to be “creative”, they already produce poorer tokens than before. Increasing will only help to break deterministic output for you on repeated runs.

Thank you very much for your answer. So you recommend temperature = 0.4 and top_p = 1?

I’m also using a local Huggingface model (GPT-J) where I can set top_k. What value would you recommend for top_k in that case? The default value is 50 for top_k.

The top-k is how many tokens from the highest ranking ones are to be considered; others below that are excluded. Setting it to 1 and you are almost guaranteed the predicted choice that temperature can’t affect. The quality of tokens goes down quickly after the first few, you might get some extra carriage returns, comma to continue a sentence instead of a period, more hyphens as a line break, different ways to start producing a list of 10 fun facts. And it depends on if they have near equal weighting or instead a clear answer.

How many more possibilities do you need for “a yellow fruit” than a banana that will always be chosen? A lower number might have an infinitesimal improvement on performance. A number equivalent to the whole token dictionary size doesn’t really matter if the temperature is under control. Good to let the AI company choose the optimum.

Thank you very much for your explanation. So what is OpenAI using as top_k value by default?

Unknown. The top-k is big enough that a ridiculously-high temperature gives ridiculously-unlikely tokens.

Hi can you share if temperature and Top K parameters be set only using ChatGPT API or is there any way through which I can change these parameters while using ChatGPT on Open AIs website?

There are two separate things: ChatGPT, and the API for accessing AI models via software.

ChatGPT, the website, doesn’t let us know the settings they use and there is no hack way of discovering it (although one could do statistical analysis over hundreds of the same session).

The setting of ChatGPT is likely similar to recommendations seen in various documentation: temperature=0.7, top-p=1.0 (default, no limitation). The top-k is likely unset, meaning all 100k tokens are considered.