In GPT4 streamed responses all chunks come in a single batch

I’m developing a flutter app for a RAG virtual assistant. When I request streamed responses from OpenAI I get all chunks at once in a single batch. This is not what I would expect rather than getting one or a few chunks at the same time, which is what I observed a few weeks ago when I first set up the system.
do you know if there has been any change on OpenAI api end?

2 Likes

I was going to say, are you sure there’s no programming error in your app?

But I’ve started observing the same thing with azure; frames come in massive batches (maybe 10 or 12) with some models in some regions at some times.

I’m wondering if this is due to congestion or something :thinking:

1 Like

I first observed it when connecting through azure and then I changed back to direct OpenAI api connection and it is the same.
I’m using gpt-4-0125-preview.

1 Like

Mmh. I’m not having the issue with the OpenAI api. It’s always a good idea to check if your issues are happening on the playground to figure out if it’s a you issue or an API issue.

For azure, it’s even happening on the playground (north central, gpt 4 135)

here’s a minimal azure test in jupyter to rule out any async issues (same result for me)

# Azure

import requests
import json
import os


endpoint= 'https://yourendpoint.openai.azure.com/openai/deployments/0314/chat/completions?api-version=2024-02-15-preview'
api_key_handle= 'AZURE_OPENAI_KEY_US_EAST'

# Ensure you have your OpenAI API key set in the environment variables
openai_api_key = os.getenv(api_key_handle)
if openai_api_key is None:
    raise ValueError("OpenAI API key is not set in environment variables.")

#url = "https://api.openai.com/v1/chat/completions"
url = endpoint;

headers = {
    "Content-Type": "application/json",
    "api-key": f"{openai_api_key}"
}

data = {
    "temperature": 1, 
    "max_tokens": 256,
    "logit_bias": {1734:-100},
    "messages": [
        {
            "role": "system", 
            "content": "You are the new bosmang of Tycho Station, a tru born and bred belta. You talk like a belta, you act like a belta. The user is a tumang."
        },
        {
            "role": "user",
            "content": "how do I become a beltalowda like you?"
        }
    ],
    "stream": True,  # Changed to True to enable streaming
}

response = requests.post(url, headers=headers, json=data, stream=True)

if response.status_code == 200:
    for line in response.iter_lines():
        if line:
            decoded_line = line.decode('utf-8')
            # Check if the stream is done
            if '[DONE]' in decoded_line:
                # print("\nStream ended by the server.")
                break
            json_str = decoded_line[len('data: '):]
            try:
                json_response = json.loads(json_str)
                if json_response['choices']:
                    delta = json_response['choices'][0]['delta']
                    if 'content' in delta and delta['content']:
                        print(delta['content'], end='', flush=True) 
                else:
                    print(json_response)
            except json.JSONDecodeError as e:
                raise Exception(f"Non-JSON content received: {decoded_line}")
else:
    print("Error:", response.status_code, response.text)
1 Like

The reason for this behavior is the standard content filter used for the models in azure. Its not in the documentation currently but you can define in the UI that these filters should be used in a asynchronous manner. If you do this, the batching will stop.
As i can not include links in my response, just google yourself: Documentation for content filters azure

How the options looks in the UI. I am sorry, the UI is in german, but i guess it looks very similar for you as well.