Completion not using entire function result for answer

My use case is to build an assistant that, based on user input, queries some weather data, analyzes it and provides an answer.

In order to query the data I defined some functions the model can call (https://platform.openai.com/docs/guides/function-calling), which return a JSON object that is then added to the chat history.

This works pretty well and the model is spot-on in deciding what to call. However, it seems that sometimes it only uses parts of the data that is stored in the JSON.

For example, to answer one question the model needed to sum all values in an array that looks like this

{
 'sunshine_duration':
 [10800.0,
  10440.0,
  7920.000000000001,
  9360.0,
  ...
  6840.0,
  7560.0,
  10440.0,
  8640.0] // 365 total values
}

The sum the assistant gives me is wrong.
I know this because I can see what the model is calling and analyzing as function result.

When asked for clarification, it gave me these as the values used for the sum

 [10800,
  10440,
  7920,
  9360,
  ...
  14760,
  7200,
  7920,
  8280] 

Curiously enough, the first ones are ok, but the last ones are not.
I’ve asked it multiple times and it always gave me a different answer for the last values but not for the first ones so my guess is that it is somehow truncating the JSON input when processing it and then making up the data (I cannot find this sequence of data anywhere in the function result).

Does the max_tokens parameter has any control on this? How can I control the size of the input that the model can process?

Maybe I’m not using the function calling/tools capabilities in the way they were designed, but the official documentation does not talk about any hard limit in the response so I’m not sure.

If I wanted this kind of response I’d ask directly chatgpt :joy:

1 Like

LLMs suck at math. You need to sum the array before passing the results back to the model.

1 Like

Wasn’t this a problem in the first iterations of LLMs?
I was under the impression that nowadays asking to make a sum of an array with 365 values shouldn’t cause issues, no?

But anyway I get what you mean, that’s why I was asking if the functions/tools calling is designed to be used like this or not.
My idea was that I could avoid having to aggregate and compute the results, as the assistant would be able to do that directly. This would also allow me to let the assistant analyze all possible results without collapsing them onto a single value before.
Furthermore, coding the aggregation part on my side is not trivial because the user can ask many different things, and the operation needed to get the results could be literally everything, which makes it hard to write some code that can interact with the LLM.

Given the truncation of the input data I wasn’t thinking that the aggregation itself was wrong, but rather that the data was not read correctly.

You should never rely on an LLM to perform math at any scale. That’s why we give it tools. Think of the LLM like a person and the tools like their calculator. This person may be extraordinarily good at doing math in their head, but I doubt even the most capable savant on the planet could accurately sum a vector of hundreds of floating point numbers in their head.

That’s a nice metaphor :slightly_smiling_face:
Ok I see what you’re getting at.
I tried to refactor my tools so that the assistant can decide an aggregation function to be called before passing to the tool execution… This reduces the data stored in the session and seems to work pretty well.
Of course it would be good to have something that does the computation on its own…maybe that’s for a new version :+1:

Food for thought…

The cost of the model preparing an array of hundreds of numbers to send to a second tool is orders of magnitude higher than preemptively computing the sum and sending it along with the initial response. I would run your data and scenario by GPT and ask it all the ways it could analyze the data and talk strategies to prepare the data in the function to give it back to the model so it doesn’t have to burn tokens calling a secondary tools.

1 Like