(This can’t be done completely in assistants. I show chat completions.)
Both gpt-3.5-turbo and gpt-4 of the latest variety have a problem with calling unwanted functions that have hallucinations because a function would be of no use to fulfilling the user input. This started a few weeks ago.
This is especially brought out when asking for multiple similar answers, but can happen most any time, especially with 0-shot input.
I pondered what could cause this. I suspect it is a softmax issue, where the first four logprobs form a parallelogram of blocking in embedding space, or simply that providing uncertainty in how output should begin allows “start a function” token to rank high. We are blocked from doing research because of OpenAI’s limitations on exposing token numbers related to or within function output and unseen methods being used.
The solution is to make output to a user extremely distinct and unique from a function, so the AI basically has just two choices of the first thing to generate.
This can be something like an enforced prompted json container of user output, but this can impact the quality of response more than we’d desire.
My solution:
assistant response always starts with %%%%
, which is token 5434. We then multi-shot the AI to show this.
"messages": [
{
"role": "system", "content": """
You are ChatGPT, a large language model trained by OpenAI, based on the GPT-4 architecture.
Knowledge cutoff: 2023-04
Current date: 2024-02-13
All assistant responses begin with '%%%%'.
""".strip()},
{
"role": "user", "content": "Hello?"
},
{
"role": "assistant", "content": "%%%%Hi! How can I help you today?"
},
{
"role": "user", "content": "What is the capital of France? What is the capital of Germany?"
},
],
"tools": toolspec,
This would have called completely useless multi-tool functions without.
More oddities affecting developers that are out of their control.
How likely the AI is to respond to a user and not a function you would think would be in our control:
"logit_bias": {
5434: 3, # the token for %%%%
},
But set that to 100 against a question where a “random” function would just have mild usefulness. The AI is still trying to call a function, and filling the function with %%%% gets a 500 server error. Even using gpt-3.5-turbo-0125.
Set that to 20, against the same question (pick from a list), and still a function, even though the AI can no longer write:
{
"id": "call_jBq8lmyQ8kFAkehIs54ctYi7",
"type": "function",
"function": {
"name": "get_random_int",
"arguments": "{\"%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%"
}
}
OpenAI has something beyond logprob that is affecting functions. Logprobs are also useless for fixing other types of problems, like those within response_format
.
OpenAI: Give us the control. An API parameter tool_bias: 3
— because you broke the AI otherwise.