Function calling temperature

We have automated tests that check if we implemented our function calling properly. The tests look something like that, just for general information:

await testToolPrompt(['Hours on job 12345'], {
    name: 'my_tool',
    arguments: {
      jobId: 12345,
      isHoursRequested: true,
    },
  });

The test calls completion function hundreds of times to make sure that AI returns proper arguments (second argument is what we expect). With stats of success and fail we know if our function is not OK and requires additional changes.

I’ve played with temperature and surprising for me it makes significant effect on quality of function calling responses. I naively thought function calling isn’t just a text generation and should be precise with any temperature given. With temperature 0 we have all arguments and values we expect, but with temperature 2 it generates random arguments that make no sense at all. With temperature between 0 and 2 we have average but still low quality results.

We want to keep possibility to set temperature for general text generation, but we also expect our functions to generate precise arguments. Is there a workaround that allows us to keep using temperature but make it not fuck up our function calling? Thank you!

The AI generates tokens, whether the text of squirrel poems or the invocation of function parameters, and it is still processed through the logit function allowing less-likely token options to be chosen, unless temperature is cranked up to 0.01. (100x logprob spread).

I’ve started (and finished for me) such an idea. It would be improved if it understood the information domain of asking that would invoke functions.

One might resubmit if a function was invoked - but the very action of a function being called instead of content text being produced is decided on by the first token output probability.