Is there a single sampling method used during inference, or there's a logic to use different sampling methods based on a given input?

Hi community,

I was going through the sampling techniques the leading LLM powered products (ChatGPT, Mistral, Claude) might be using.

To give a context, the simplest way to sample a token from the list of output probabilities is to get the token with the highest probability.

Which has it’s limitation (non-creative, predictable).

So to introduce some creativity in the response, people started using ‘Stochastic Sampling Methods’ like Top-K, Top-p to introduce randomness.

My question is - In cases where the output is desired to be deterministic. For example a puzzle or a simple mathematical expression.

Sampling through Stochastic Sampling Methods feels counter intuitive. How does the AI engineers/researchers/companies handles this?

How do they make sure ‘Stochastic Sampling Methods’ are reliable to get the deterministic output?

Is there a single sampling method used during inference, or there’s a logic to use different sampling methods based on a given input (that requires deterministic or non-deterministic output)?

Thanks!

1 Like

The whole point is NOT to have deterministic or top token results.

The use of unexpected words makes the response more passable as human-like.

ChatGPT (the web chatbot) doesn’t allow you to access any parameters - diversity is also the point there, something for you to dislike if you don’t enjoy the production, something to upvote if it was favorable, to gather training data.

The API of OpenAI presents two sampling parameters: top_p (nucleus sampling) which constrains the token dictionary, and then temperature (logprob scaling). That lets you skew the results to the most certain, or at very small values, basically output the top token choice always (greedy sampling). The opposite can veer towards craziness.

There is also a seed parameter, which can repeat the same sampling randomness so that a diverse output can be repeated.

temperature and top_p :0.00001 results

AI engineers and researchers handle the challenge of balancing creativity and determinism in language models by carefully choosing and tuning the sampling methods based on the specific task at hand.

For tasks that require deterministic outputs, such as solving a puzzle or a mathematical expression, they might use a more deterministic sampling method like “greedy decoding” or “beam search”. Greedy decoding always chooses the most probable next token, while beam search expands the most promising nodes in a tree of possibilities to maintain a number of alternative sequences at each step.

For tasks that require more creativity, they might use stochastic sampling methods like Top-K or Top-p (also known as nucleus sampling). These methods introduce randomness into the sampling process, which can lead to more diverse and creative outputs.

In practice, the choice of sampling method can be dynamic and context-dependent. For example, a language model might use a more deterministic method for one part of a task and a more stochastic method for another part. This could be achieved by using a mixture of different sampling methods, or by dynamically adjusting the parameters of a single method.

The reliability of stochastic sampling methods can be improved by using techniques like temperature scaling (which controls the randomness of the sampling process) and by training the model with a large amount of high-quality data.

In summary,

top_p: 0.995, temperature: 1.5 results

Both stochastic and deterministic sampling methods are valid and applicable for different use cases during the process of natural language generation. It indeed is challenging to ensure stochastic methods being more reliable for generating deterministic outputs.

For important or predictable outputs (e.g. following up on Math or Puzzle-related topics), the usage of beam search or optimized temperature at the token-selection level has been experimented by researchers in the industry. In most significant cases like a QA system or a chatbot working upon a developer or assembly code prediction, greedy search (picking up the highest probability or maximum likelihood) shows the best results and thus used the most. The reasons also cite better validation metrics for controlled, safer, and correct completions.

In contrast, Stochastic Decoding produces effectively satisfactory as well creative textual outcomes. As mentioned, previously these practices usually involve techniques - Top-K (randomly selecting next term from Top K potential possibilities) and Nucleus or Top-P sampling (includes all minimum highest-probability predictions jointly maintaining below selected threshold P).

Where exactly a form would be apt over others rests at ambiguity deriving relationship of Text from Latent Space more than others while preserving contextual clarity including slang/brands/terms and variety hinting relations at different position levels. There exists no rigorous way to alternate among these samplings

1 Like