4096 completion token limit with gpt-4o. Using assistant streaming API

Hi guys,
I chat with OpenAI support and they confirmed that the gpt-4o has a limit of 4096 completion tokens and I should use a strategy to work around it. (btw, It is not documented on the model card). Now, I am trying to catch when the assistant streaming api call hits the limit. But, it seems to me that there is a discrepancy between the documentation at https://platform.openai.com/docs/assistants/deep-dive/run-lifecycle and how the API actually behaves.

The documentation says:

incomplete: Run ended due to max_prompt_tokens or max_completion_tokens reached. You can view the specific reason by looking at the incomplete_details object in the Run.
in_progress: While in_progress, the Assistant uses the model and tools to perform steps. You can view progress being made by the Run by examining the [Run Steps](https://platform.openai.com/docs/api-reference/runs/step-object).

How the APi behaves:
To be clear this is my Python code:

with (OpenAIClient.api.beta.threads.runs.stream(**stream_params) as stream):
       run = stream.get_final_run()

A) When I don’t set the max_completion_tokens or set it to a higher number than 4096. The run finishes without any errors, with “completed” status code. Everything including the “usage” information is available. The only way to see the problem is to actually look at the end of the message to see if the sentence or json, or whatever result I am expecting is cut off.

B) To work around this, I tried to set the max_completion_tokens to 4096. But, on the contrary what the documentation says the run finishes with “in_progress” status rather than “incomplete”. Also, the run does not return “incomplete_details”. But, I found out that the message object returns incomplete_details with the reason “max_tokens”. So, the following code seems to be working to detect if the message has been cut-off and that I need to apply my strategy.

if run.status in ("in_progress", "incomplete") and \
                                final_messages and len(final_messages) > 0 and \
                                final_messages[0].status == "incomplete" and \
                                final_messages[0].incomplete_details.reason == "max_tokens":

Just want to check if you see any issues in this approach, if I am in the wrong path, if you encountered this issue?

2 Likes