Managing function calls that require clarification

If I understand correctly, the general structure for managing tool-calls is:

# recursive run handler
def handle_run_requires_action(run):
   if run.status == 'completed':
      messages = client.beta.threads.messages.list(thread_id=run.thread_id)
      latest_message = messages.data[0].content[0].text.value
      print(latest_message)
      return
   if run.status != 'requires_action':
      print(f"Unexpected run status: {run.status}")
      return
   tool_outputs = []
   for tool_call in run.required_action.submit_tool_outputs.tool_calls:
      # handle your tool calls and get the output, using an example:
      output = {'status':'success', 'message':'tell the user tool_a is cool'}
      # convert to a JSON string
      output = json.dumps(output)
      tool_outputs.append({"tool_call_id": tool_call.id, "output": output})
   tool_run = client.beta.threads.runs.submit_tool_outputs_and_poll(
      run_id=run.id,
      thread_id=run.thread_id,
      tool_outputs = tool_outputs
   )
   handle_run_requires_action(tool_run)

# initial run
run = client.beta.threads.runs.create_and_poll(
   # include thread, assistant, etc...
   additional_messages=[
      {'role':'user', 'content': user_message}
   ],
   tools=[tool_a, tool_b]
)
handle_run_requires_action(run)

This works fine, as long as your tools are all-or-nothing success/failure. But while experimenting, I came up with some questions:

My first question is: can an OpenAI rep provide any guidance on the format of output? Iā€™ve tried giving it various parameters:

  • status: either ā€˜successā€™ or ā€˜failureā€™. Strangely, when this is ā€˜failureā€™ and multiple tools are available, sometimes the assistant will try calling another tool with similar parameters.
  • message: text that is interpreted by the assistant and shown to the user. This has the most influence on the assistant. Additional instructions to the assistant can be provided here and itā€™ll listenā€¦sometimes.
  • error: a more technical error message targeted toward the assistant rather than the user. Doesnā€™t seem to do much
  • additional_instructions: seems to be very effective. This is the only way to get the assistant to give up and not retry a tool with ā€˜failureā€™ status sometimes.
  • call_tools: true/false. Only seems to work in edge cases, providing a little extra push needed to not retry a failed tool call.
  • data: dictionary with various information related to a message. Usually seems to be taken into consideration by the assistant.

My second question is: are we expected to perform fine tuning for function calling to get reasonably accurate performance, or should we be able to handle it with properly formatted tool outputs in most cases?

The final question: How can we get clarification for a tool while the Run has the status requires_action? Frequently it is desirable to have a tool that isnā€™t strictly pass or fail. For example, suppose you have a tool for querying a nutrition database and the user asks for the nutrition values for ā€˜applesā€™. The database contains specific kinds of apples like ā€˜fuji applesā€™ and ā€˜gala applesā€™. The tool detects these similar items and wants to ask the user if they meant one of the similar items, rather than tell the user that ā€˜applesā€™ was not found. But you canā€™t add any messages to the thread or call a new run on the thread until the current thread has had submit_tool_outputs called.

I made a version that uses a chat completion supplied with the necessary parameters to ask the user for clarification and get new arguments for the function, and while this will give me successful output data that I can use to call submit_tool_outputs, there is some context lost in the thread, because it is unaware of the clarification question and the userā€™s response. This can lead to wrong-sounding responses like ā€œHere is the nutrition data for applesā€ even though it should say ā€œHere is the nutrition data for fuji applesā€. Is it possible to add the messages of the chat completion to the thread before it formats its response from submit_tool_outputs? Or is there a completely different way that I should be handling this?

This is a community forum, you are more likely to get feedback from a member of it.

You can do anything you like in an answer but you should play to LLM strengths.

Your ā€œanswerā€ should provide the information the LLM is expecting to get from the definition of the function. Donā€™t overcomplicate it. You can feedback errors.

Not in my experience, in fact it should be a great alternative to fine tuning as you can use it for RAG instead.

Just keep your tool count to a reasonable number and lower your temperature.

This is a really interesting problem.

Can you return this as an error condition and request the LLM to ask the user to clarify which apples they mean?

once the user responds, the LLM might then re-submit the function call with the additional detail.

Another approach here might be to maintain some kind of state in the system prompt.

I personally donā€™t use the Assistant API because I prefer to build everything from scratch which gives you more control.

Well we donā€™t exactly have a hotline to call so our best bet for clarification is to group up and make a collective stink here.

For very simple queries, yes. But itā€™s error prone - as stated in the first question, thereā€™s no documented way to force a failure response. Also, if the user had to re-enter a lot of parameters, it would be an inconvenience. More importantly, suppose the user prompts: ā€œgive me the nutrition data for oranges and grapesā€. Now youā€™re in a parallel tool call situation. Tool calls must have all outputs submitted at the same time. Regardless of how many tools are called, you cannot perform any operations on a Thread while the Run is in a non-determinate state. Besides my completion solution, I could also make it work by canceling the run, and rebuild it with the new messages, and use another completion to formulate the response to the aggregate success outputs. But that seems even more hacky and error-prone. Surely there is something better.

1 Like

Haha, completely agree :slight_smile:

yes, that does sound hacky.

the parallel situation does sound very tricky to handle and does highlight you want to keep the solution as simple and as ā€œin frameworkā€ as possible.

one thought:

in the case of the user not providing sufficient information to provide a specific apple - how about you provide all apples in the response from the tool?

then does the LLM respond to the user detailing a list of all the apples it knows about and their corresponding values?

I tried this example on my own Chatbot on the instance that sits on a Star Citizen fansite.

It searched the forum and pulled back the ship stats, the latter which fits conveniently into a single context window, but wouldnā€™t have to if I needed to add a filtering capability.

But it proves that a good approach is simply to return multiple results and let the LLM handle it from there.

Iā€™m very curious to hear other members responses.

1 Like

Unfortunately this wonā€™t scale, and it wonā€™t work at all for this type of collision in a tool that performs some kind of write operation.

I think that the guiding principle here has to be: is it following commands the way a human would? These are called Assistants after all - the entire concept weā€™re being sold is that these massively scaled up language models can be made to perform tasks with natural language. Take the example of a restaurant. If I order a meal that is no longer on the menuā€¦

AI : :robot: Sorry, the menu item you requested isn't available, please try again. I'm always here to be your helpful waitress.

Human: Iā€™m sorry sweetie, but they took at off the menu last month. Wait here a second, Iā€™ll go ask the chef if we can still make it anyway since itā€™s your birthday~ :slightly_smiling_face:

Thereā€™s a huge gulf between generating a valid response and delivering on the actual promise that is being made to the end user. These things have to be helpful, not babysat. If the user has to carefully plan their words, or constantly re-state commands, or sift through mountains of data, this will go nowhere and die off very quickly.

2 Likes

If the user does not specify the exact apple, Iā€™m sure you can detect the lack of precision and send back an error that the question was not precise enough and to retry with a specific apple?

Sure, thatā€™s Question #1. You can make it happen, but making it happen reliably is the question. Here is an example:

model: gpt-4o-mini
query: 'Add 30lb of apples to inventory.'
tools: [tool_add_item_to_inventory, tool_create_new_inventory_item]

tool_output: 
Tool Call: tool_add_item_to_inventory
{'item_name': 'apple', 'quantity': 30}

Output: tool_add_item_to_inventory
{
  "status": "failure",
  "message": "No item with id: [apple] was found, but there are similar items: [gala_apple] and [fuji apple]. Please try again with the specific item, or say that you want to create a new item.",
  "additional_instructions": "Do not continue calling tools. Report the error to the user."
}
Result:
Assistant returns to 'requires_action' status and ends up calling tool_create_new_inventory_item on the next loop:
Tool Call: tool_create_new_inventory_item
{'item_name': 'apple', 'quantity': 30}
Output from tool: tool_create_new_inventory_item
{
  "status": "failure",
  "message": "The item [apple] could not be added to the database. Instruct the user to report the error to Joe."
}
Result: Assistant finally gives up and responds:
I encountered an issue while trying to add 30 lb of apples to the inventory. It seems that the item "apple" could not be added. Please report this error to Joe for further assistance. 

Despite clearly indicating a failure, NOT to continue calling tools, and to prompt the user, the Assist still does the exact opposite and calls a completely different tool before finally giving up (perhaps because it exhausted all options) and only reports the error for the second tool, even though the error from the first tool is what the user should have been shown.

A very slight adjustment to the output, however, produces the desired outcome:

model: gpt-4o-mini
query: 'Add 30lb of apples to inventory.'
tools: [tool_add_item_to_inventory, tool_create_new_inventory_item]

Tool Call: tool_add_item_to_inventory
{'item_name': 'apple', 'quantity': 30}
Output: tool_add_item_to_inventory
{
  "status": "failure",
  "message": "No item with id: [apple] was found, but there are similar items: [gala_apple] and [fuji apple]. Please try again with the specific item, or say that you want to create a new item.",
  "additional_instructions": "Report the error to the user and wait for the users next prompt."
}
Result: The Assistant correctly fails out and asks for clarification:
It seems that there is no item listed as "apple" in the inventory. However, there are similar items available: "gala apple" and "fuji apple." Would you like to add 30 lbs of one of these specific types of apples, or would you prefer to create a new item for "apple"?

Iā€™ve been trying dozens of formats, ways of wording responses, error instructions, etc. This second method with a very specific phrase in additional_instructions appears to work consistently with a few test runs, but does it really? Thereā€™s no way to know. And even if it does, itā€™s very likely that future models or even updates to the same model may break this ā€œmagic phraseā€, which is why I want a real solution. This isnā€™t a real solution, itā€™s a workaround. It is ā€œgood enough to get me through this JIRA ticket and collect my paycheck this weekā€. But Iā€™m not working for a paycheck. Iā€™m writing this software for my own business. I care about long-term stability, and I donā€™t get paid to solve the same problem multiple times. If OpenAI expects developers to write real-world customer-facing applications with their API, there must be certainty that it will work reliably. This isnā€™t a complicated problem; itā€™s unthinkable to me that there would be no way at all to ensure a tool fails. But it probably requires the input of one of their engineers. Us discussing workarounds serves no purpose.

1 Like