I’m using GPT-4.1 model to generate response based on retrieved data (40-50 rows in csv format). The data is given as a context (context size is ~5000 input tokens) to the model. However, when I set lower temperature, generation takes too much time (sometimes up to 10 mins).
Why longer generation times with sampling parameters sent?
The product of AI inference is a map of the model’s token numbers to their certainty prediction.
These are then used by a ranked multinomial sampler to deliver a random sampling favoring likelihood.
Temperature modifies these logarithmic probabilities by dividing them by the input parameter, causing a redistribution in values. An extra operation vs using the native model output. With a 200k token dictionary and a model generating 100TPS, one can anticipate then 20 million additional CPU divide operations per second needed.
Temperature still should be an effect that is only in the “detectable when benchmarking”, not really noticeable. You might be misattributing causation.
You instead might use the top_p parameter for your desire to constrain to quality, which, instead of redistributing values, truncates by probability mass cutoff. It essentially eliminates poor tokens.
If you have an output that seems incomplete, that can just be the AI model predicting and having selected its “stop” sequence. In this case, instead of a vertical pipe and linefeed, you might receive a finish_reason of stop. Behavior is something that you can affect by prompt and model choice. The sampling parameters can increase or reduce the variance between responses using identical input.