Assistant max_completion_tokens not working as expected

OpenAI recently added the max_prompt_tokens and max_completion_tokens.

In my scenario, the assistant is supporting multiple clients such as Web, Mobile APP and SMS.

I tried to set the max_completion_tokens for SMS clients in order to reduce the message length to fit into the SMS standards when creating the thread run.

const threadRun = await this.openAI.beta.threads.runs.create(
        threadId,
        {
          assistant_id: assistantIdToUse,
          tools: [
            { type: 'file_search' }
          ],
          additional_instructions: additionalInstructions,
          max_completion_tokens: 300
        }
      );

What happened is that if the generated message exceeds the maximum completion tokens, it simply kills the thread run with an incomplete status (as stated on the Docs).

{
    "reason": "max_completion_tokens"
}

What I expected with this parameter is that it would set the model to generate a response within the max_completion_tokens size, but instead, it’s just checking if it was exceeded and killing the thread run.

Does anyone know if this is really the expected behavior?

The AI generally cannot “see” any of the parameters that you are using on the API. The AI can’t observe you set max_tokens to only 2, and then think “I’d better make those two tokens count!” (or even perceive what a token is).

So it is instructions to the AI that will shape the length of the output, either by assistant instructions, or the user’s command to produce something.

Restricting the context or completion tokens may have unexpected side effects, because the AI may be emitting language to internal tools that you never receive - and you’d never get a response back if then there isn’t enough budget left to respond to the user.

With assistants, it is a safety mechanism, a kill switch. Keep them from spending $10 on making a loop of repeated calls internally that only result in errors.

2 Likes

Yes. This is how max_completion_tokens supposed to work.

Just like max_tokens on the chat completion endpoint, it doesn’t make the assistant generate the complete response under that token count, instead it caps the max number of tokens the model has to generate the response.

1 Like

I see!

What happens is that I’ve tried to set the instructions to return only the specific quantity of characters, but it is not very reliable. Sometimes it works fine, but sometimes it returns a huge message.

Due to cost limit, we can only use the gpt-3.5 models for now, so I was really hoping that the max_prompt_tokens would solve this issue for us.

We are currently using the gpt-3.5-turbo-1106 model.

So apart from the instructions, is there any other more reliable approach that can be used to limit the generated message length?

I say “generally” the AI doesn’t know - because you can make it know.

Just as I can tell the AI “your provided top_p tool is set at 1.0, good for poems or brainstorming, not good if you intend on writing code”, a system message injection can be passed from your natural language mapping list: “API User has set maximum AI response hard limit and output truncation point to 100 tokens (of 4096 possible) - good for maximum 50 words or 5 sentences of AI response.” or even “5 tokens, only one word will be reliably seen!!”

Then it is up to the AI training to follow that instruction – or follow its training for a “business email” instead.

(OpenAI made the AI quite reliably not put out anywhere near 4096 tokens, but instead wrap up its long task prematurely under 1000, so those that run the AI have better control…)