GPT4o Hangs after first chunk

beall3 · June 4, 2024, 10:07pm

I’ve noticed in the last few days that api calls using gpt-4o are hanging after the first chunk is received. I don’t think this was happening previously. Whats weird is as soon as the second chunk comes back the entire payload is right behind it, similar to a non-streaming request.

Anyone see me doing something wrong?

Using Azure, latest openai nodejs version

  const stream = await openai.chat.completions.create({
    model:       'gpt-4o',
    messages:    messages,
    stream:      true,
  });

  for await (const part of stream) {
    console.log(`created ==> `, part.created, Utils.getFormattedDate());
    ...
  }

  created ==>  0 2024-06-04T14:57:53.299
  # see the ~18 second difference here between the first chunk and second
  created ==>  1717538273 2024-06-04T14:58:11.624
  created ==>  1717538273 2024-06-04T14:58:11.626

Diet · June 4, 2024, 10:24pm

Hmm

Tokens have been coming in batches with some deployments for quite a while now. This seems to be unrelated to the openai library, but rather something on MSFTs end I just take as a given at this point. How long are your responses?

beall3 · June 4, 2024, 10:27pm

They’re usually larger (the responses) as we use this model in RAG and upload contexts.

I was really hoping I was doing something wrong, and it wasn’t just azure…(I had a feeling)

Diet · June 4, 2024, 10:29pm

Well, the first step to diagnosing issues is to try to replicate it with a simple curl call (or using the azure console). If you get the same same effect, it’s probably all you’re gonna get.

It’s possible that it’s a library issue, wouldn’t be the first time, but it really depends. I don’t have a 4o deployment at the moment but the issue sounds familiar.

_j · June 4, 2024, 10:33pm

Perhaps the Azure content_filter running on the output, processing in long moderation pieces?

I’ve seen vision models and others return bursts of chunks but not chunks longer than tokens or multi-character tokens on OpenAI.

Current chunk timing (me downloading the URL and sending base64):

Prompt: have a looky
Image URL? (enter=none):https://i.imgur.com/SQNDPKS.jpeg

What a cozy scene! The image shows a black cat nestled comfortably among some beige blankets and pillows. The cat has striking yellow eyes and appears to be quite relaxed. The background features a floral-patterned pillowcase, adding a touch of color to the setting. It looks like a perfect spot for a cat to rest and enjoy some quiet time.[5.9, 5.92, 5.92, 5.92, 5.92, 5.92, 5.92, 5.93, 5.93, 5.93, 5.93, 6.02, 6.07, 6.12, 6.18, 6.3, 6.3, 6.5, 6.54, 6.54, 6.54, 6.55, 6.55, 6.55, 6.57, 6.57, 6.57, 6.57, 6.58, 6.59, 6.59, 6.59, 6.6, 6.6, 6.6, 6.62, 6.62, 6.62, 6.64, 6.64, 6.65, 6.65, 6.65, 6.67, 6.67, 6.67, 6.68, 6.69, 6.69, 6.7, 6.7, 6.7, 6.72, 6.72, 6.74, 6.74, 6.75, 6.75, 6.77, 6.77, 6.78, 6.79, 6.79, 6.81, 6.81, 6.82, 6.82, 6.84, 6.84]

(note to the developer - had to add a referrer header for imgur links along with other headers to fake the image request coming from within a browser looking at the site)

beall3 · June 4, 2024, 10:54pm

yeah looks like azure, same behavior using curl and no-buffer…

beall3 · June 4, 2024, 11:09pm

@_j thank you both, I changed my custom content filter to do async inspection (never saw this before) and it started responding in lightning speed!

Diet · June 4, 2024, 11:14pm

It looks like it could indeed be a content filtering issue as @_j mentioned:

here’s the docs

The content filtering system is integrated and enabled by default for all customers. In the default streaming scenario, completion content is buffered, the content filtering system runs on the buffered content, and – depending on the content filtering configuration – content is either returned to the user if it doesn’t violate the content filtering policy (Microsoft’s default or a custom user configuration), or it’s immediately blocked and returns a content filtering error

but it also talks about an async content filter:

Asynchronous Filter

Customers can choose the Asynchronous Filter as an additional option, providing a new streaming experience. In this case, content filters are run asynchronously, and completion content is returned immediately with a smooth token-by-token streaming experience. No content is buffered, which allows for a fast streaming experience with zero latency associated with content safety.

a qualitative look at the difference

default:

async:

now as to how to figure out the cli command to fix all your bajillion deployments…

Diet · June 4, 2024, 11:15pm

you beat me to it haha

Topic		Replies	Views
In GPT4 streamed responses all chunks come in a single batch API streaming	4	3183	June 7, 2024
Issue with Chunk Streaming in ASP.NET Core using GPT-4 API API chatgpt , api , streaming	0	973	January 30, 2024
Streaming events returned bunched up Bugs	8	220	July 19, 2024
How to speed up GPT4 generation Feedback gpt-4 , chatgpt , api	10	6109	January 29, 2024
Malformed streaming answers from GPT-4 completions API lately API	11	2316	November 13, 2023

GPT4o Hangs after first chunk

Asynchronous Filter

Related topics