Best `Temperature` for Code Generation

I’m working on a Agent Tool that generates source code and/or refactors code, and I’m noticing of course I get varying results to fairly simple requests like “Add a Date Field” to a class or whatever.

Do people have opinions on what’s the best temperature for code generation. I’m thinking it would be fairly low, but nonzero? I appreciate any thoughts people might have.


t= 0
top_p = 0


I’m thinking that these parameters might be more of a legacy thing. Back with these weaker models, you’d have to jiggle them so that they don’t fall into a repetitive rut. This doesn’t seem that important anymore for modern, massive models. I’d just consider it as a source of errors. You can deliberately introduce them, but for coding? :thinking: :laughing:

1 Like

The way I understand what LLMs are doing there’s always of course one single “next word” being generated at a time, and for determining that next word (token really maybe) every single word in the dictionary has some probability, so there’s actually a full rank-ordering of every word in the dictionary in terms of it’s likelihood to be the “next word”, and so the “temperature” is sort of letting the system dig down randomly into the lower-ranked less-probable options for “next word”.

So my thinking is that in coding you want the MOST likely thing almost always, rather than randomly sub-optimal stuff, but I guess sometimes there can be edge cases were the “most likely words” will nonetheless branch off into undesired results.

I dunno. Digging down implies intent. I see it as randomly forcing the model to make a suboptimal choice.

Hmm. An analogy that comes to mind is this:

  1. a deterministic floating point conversion algorithm will nonetheless introduce undesired rounding errors
  2. to combat this, we’re injecting random noise into the LSBs.

I suppose that could be a valid approach. But I’m thinking that the injected noise will always be a greater error than the rounding error introduced by the limitation of the system.

My feeling is that when you design a system with an LLM in the middle, you’re operating within certain error boundaries. If you introduce random noise, you need to broaden those boundaries so your operation doesn’t break down.

So there’s a certain engineering cost associated with this.

And I’m not super clear on the benefit - in the hopes that when some undesirable result were to manifest itself, that there is a 0.000x possibility that a random number generator picks a better token than the model would have? Hmmm… I’m not sold on that tbh :laughing:

Take a sentence like "The cat jumped up on the..." and let an LLM generate the next word. There’s going to be a statistical winner that will come up in the top position. Maybe could be “bed”, could be “desk”, but neither is “correct”. They’re both good words.

The same sort of thing does apply to even software code. Each word that’s not the “statistical winner” isn’t necessarily “wrong”, but just “different”. So really the word sub-optimal is not the right word, when not even humans would be able to agree on the definition of optimal.

1 Like

I’m just thinking that if you ever are in such a situation, then your problem is probably underspecified :thinking:

search "The cat jumped up on the ..."
retrieve "Whiskers the cat liked to jump on the couch"

gen "The cat jumped up on the couch"


Now if you add more pertinent context information and regenerate, that top probability should go up.

Obviously it’s up to your architecture, and possibly a matter of taste. IMO, code generation should be a deterministic thing. If it fails, it should be reevaluated on the design level. That’s why I’ve chosen zeroes for the past couple of months, and I haven’t had a reason to go back on this yet. :cowboy_hat_face:

1 Like

If the task is to execute

to some data structure then there should be almost zero creativity involved as the resulting code will be pre-determined by the project requirements, the code quality measures in place and of course the existing data structure/code base.
I don’t see any potential gains by adding creativity/randomness in this specific example.

1 Like

What I saw happening in my testing, that did happen to involve the adding of a date field, was that sometimes it would add one Date field object type and sometimes a different one. This happened to be Java and there are numerous different Data objects to choose from, and the “most popular” will tend to be what you get with lower temperatures. There’s no right answer, without providing further constraints and info in the prompt, of course.

Anyway everything you guys said in this thread confirmed what I thought was best which is low temperatures (or zero) when doing code generation. Thanks for your input.


I have similar problem to solve with ft and the seed seems the most important parameter for determinism. The repeatability of the output has been discussed many times here and settings of T = 0.000000000001 and top_P = 0.000000000001 was also introduced as most deterministic.
What i have found is that fixed seed, very low temp(or 0) work deterministically when the model is kinda sure what the answer is. If you throw something where the model do not understand/ wondering between many answers, the determinism is going in the sewer - no matter the settings of seed and temp. I have included top_p of course in the tests and it doesnt help in these cases.
The most important thing, apparently, is to have enough examples to show how the model(s) to answer for every possible case. I dunno exactly how you bring knowledge - if at all - in your Agent, but this my case with fine-tuning.