N responses vs multiple generations

Can anyone tell me how n responses generates two different responses to the same prompt? Is it the same generation process as you would get just running the query twice? I understand there are token savings, but does it also favour diverse responses? I ask because I want to compare responses to the same prompt for a validation task. I have tried just doing multiple generations at different temperatures too but I like the token savings on the input prompt with n responses, I am just worried that n responses will lack diversity in comparison.


It is an interesting point about token savings. But does it really save tokens somehow?

To your question - as I understand, it is the same as running the same prompt N times.

It saves tokens on input as you are only charged the input tokens once. So if your prompt is 2k it can add up.

But I guess I wonder if the seed or other internal parameters are the same for each of the n responses.


Cool, never though of that, thank you :slight_smile:

The only thing that might affect the output of N>1 vs multiple is the unknown source of non-determinism in GPT-3.5+ models. Like you might get an instance that has more GPU calculation errors that are unchecked.

GPT-3 like text-davinci-003 does not suffer this symptom, and the only source of random is that which is intentional in sampling.

Cool thanks. I thought that. But I assume that in that case both n responses would “suffer” the same issue

If I understand correctly: The raw response from an llm is probabilities of certain tokens, n responses lets you see a larger sampling of these probabilities (instead of just one like we are used to)

They are separate response generations. Consider two trials that start this way:

  • Sure
  • Sorry

The path of inference will follow two completely different directions. Each token is generated based on all that came before.



(Actual logprob for the curious)


1 Like

@mark_humphries curious about the same exact thing. What did you end up doing and why? If you ended up doing n>1 in the same input prompt, did the responses give you the variability you were looking for?