Overview
Continuing this post.
I’m running a data extraction tasks on documents and I’m trying to take advantage of the 128k
context window that gpt-4-turbo
offers as well as the JSON mode setting. I’m experiencing a bug where the generation breaks at token length 1024
during generation. Afaik, the output length limit should be 4096
. Even then, it looks like the bugs are similar to an issue that the docs are mentioning, but the existing solutions / recommendations are insufficient and don’t explain this specific observation I’m making.
Arbitrary Code Note
As I was writing this post, I was creating code to reproduce this error arbitrarily and got an “error message” I never got before during my data extraction tasks.
{
"status": "failed",
"reason": "The current configuration of the AI model does not support generating a response that exceeds 1024 tokens. Counting to 1500 in JSON would go beyond this token limit and is therefore not possible within a single response."
}
This is NOT a real error messages from the endpoint. It’s the GPT generated response from the chat completion. This is confusing me even more but leading me to believe there’s some config setting that’s disabled or not working as intended, or the off chance that OpenAI as a whole forgot to enable 4096
output token length.
Describe the bug
Infinite Stream of Blank Characters
- When using JSON mode, always instruct the model to produce JSON via some message in the conversation, for example via your system message. If you don’t include an explicit instruction to generate JSON, the model may generate an unending stream of whitespace and the request may run continually until it reaches the token limit. To help ensure you don’t forget, the API will throw an error if the string
"JSON"
does not appear somewhere in the context.
My prompts do include the JSON mode instruction, and even then, I observe that when the response generation reaches token length 1024
, it’ll begin generating an infinite stream of blank characters until content_filter
is triggered. Below, I’ve stripped the blank space characters to indicate the token length.
Premature Stop and Malformed JSON Response
- The JSON in the message the model returns may be partial (i.e. cut off) if
finish_reason
islength
, which indicates the generation exceededmax_tokens
or the conversation exceeded the token limit. To guard against this, checkfinish_reason
before parsing the response.
The response returned was a result of the length
stop condition. I noted that the token length of the returned response was 1058
tokens; however, on further inspection I actually saw that the JSON response was malformed (beyond being incomplete) and it began at exactly where the token length reached 1024
!
Code to Reproduce
Just create some messages and make a chat completion that’ll likely generate a response token_length > 1024
. Here’s some code below to get started
client = AzureOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
api_key=os.environ["AZURE_OPENAI_API_KEY"],
api_version="2023-05-15",
)
initial_messages = [
{
"role": self.SYSTEM,
"content": "You are an AI Assistant. You will follow the user instruction. You will write a JSON response.",
},
{
"role": self.USER,
"content": "I want you to use GPT to generate the counting to 1500. Do not create code. Respond in JSON only. I do not care about reasonableness. You will count to 1500. You will write each number individually, and your output length max is 4096 tokens.",
},
]
response_stream = client.chat.completions(
model="gpt-4-turbo",
response_format={"type": "json_object"},
messages=initial_messages,
max_tokens=4096,
stream=True,
)
response_from_stream = ""
blank_space_threshold = 125
for chat_completion_chunk in response_stream:
if (
hasattr(chat_completion_chunk, "choices")
and chat_completion_chunk.choices
):
if (
hasattr(chat_completion_chunk.choices[0], "delta")
and chat_completion_chunk.choices[0].delta
and hasattr(
chat_completion_chunk.choices[0].delta, "content"
)
):
content = chat_completion_chunk.choices[0].delta.
if content:
response_from_stream += content
if not content.isspace():
consecutive_blank_space_count = 0
else:
consecutive_blank_space_count += len(content)
if (
consecutive_blank_space_count
> blank_space_threshold
):
raise Exception(
"Encountered blank space bug."
+ "\n"
+ "Response from Stream Token Length: "
+ str(
self.num_tokens_from_response(
response_from_stream,
model=self.chatgpt_model,
)
)
+ "\n"
+ "Response from Stream: "
+ response_from_stream
)
Troubleshooting
Confirm max_tokens
and model deployment
Can confirm this runs, and increasing max_tokens
beyond 4096
will cause a model error as expected.
response_stream = client.chat.completions(
model="gpt-4-turbo",
response_format={"type": "json_object"},
messages=initial_messages,
max_tokens=4096,
stream=True,
)
Test random settings and prompts
Tried different prompts, temperature values, etc. Included more and less “JSON” instructions. Never generated more than 1024 tokens.
JSON mode on vs. off
Generation is not capped by the observed 1024
token limit when JSON mode is OFF. It appears to be a specific bug that occurs when JSON mode is on.