Gpt-3.5-turbo-instruct stream=True not working?

Hey folks,

gpt-3.5-turbo-instruct is super impressive, however, noticed one really weird thing - streaming doesn’t seem to work. It works fine in the playground, so I am just wondering if I am doing something wrong.

Vanilla code for testing:

import os
import openai

openai.api_key = os.getenv("OPENAI_API_KEY")

response = openai.Completion.create(
  prompt="Write me a poem"

for chunk in response:

Always results in:

  "id": "cmpl-81gUZ3QrUBZMFsGzhS3sxRAappqlZ",
  "object": "text_completion",
  "created": 1695412359,
  "choices": [
      "text": "",
      "index": 0,
      "logprobs": null,
      "finish_reason": "stop"
  "model": "gpt-3.5-turbo-instruct"

It’s not a biggie, just to ascertain what exactly is happening here

The new turbo-instruct model uses the old completion endpoint not the new chat completion endpoint, so it doesn’t have a stream option…

You can find more on all the models here…

@PaulBellow according to OpenAI documentation, the legacy completion endpoint does have the stream option:

1 Like

You linked here…

Create chat completion

You can see legacy completions endpoint here…

ETA: Looks like completion should have stream too…

I wonder if the new turbo-instruct completion endpoint is different / new?

ETA2: Looks like the link for cookbook example is talking about gpt-turbo (chat endpoint) not gpt-turbo-instruct… So… might be a bug or new completion endpoint with docs not updated yet.

ETA3: Might try the example they used?

import os
import openai
openai.api_key = os.getenv(“OPENAI_API_KEY”)
for chunk in openai.Completion.create(
prompt=“Say this is a test”,

Correct, I already wrote a stateless utility.

Set the first variable to stream=False and get the alternate output method with tokens per second.

import time
import openai
openai.api_key = key
system = """
An AI assistant replies to user input. It keeps no memory of chat.
assistant: I am a helpful artificial intelligence, capable of many human-like tasks.
user = "Write an introduction a user will see when they first start your chatbot program"
while not user in ["exit", ""]:
    stime = time.time()
    api_out = openai.Completion.create(
        prompt = system + "\n\nuser: " + user + "\nassistant:",
        model="gpt-3.5-turbo-instruct", stream=stream, max_tokens=666)
    ctime = round(time.time() - stime, ndigits=3)
    if stream == True:
        for chunk in api_out:
            print(chunk["choices"][0]["text"], end='')
        ctokens = int(api_out['usage']['completion_tokens'])
        tps = round(ctokens / ctime, ndigits=1)
        print(f"-- completion: time {ctime}s, {ctokens} tokens, {tps} tokens/s --")
    user = input("==>")

I’ve tried the legacy completion endpoint with turbo-instruct model, with stream and it is working correctly. My test was using the simple-openai Java library


ETA: The attached image was too small. Perhaps you can have a better image in this link:

1 Like

Nice, I got it to work in the end, but it seems somehow intermittent. Trying to find the diff in the params now


I just had it barf on me once after a few tokens, but that’s at temperature=1 so it might have just output the possibility of an “end” instead of finishing my multipart banana-peeling instructions.

PS if you don’t like their streaming, you can do your own expensive “streaming”. Ask for max_tokens =1. Add that token to the end of your prompt. Call the API again. End when you get a null. :smile:

(That actually has research use, like you can make your own top-k sampling (up to 5). What if you make a response out of only the second-best token choices?)