Gpt-3.5-turbo-instruct stream=True not working?

Hey folks,

gpt-3.5-turbo-instruct is super impressive, however, noticed one really weird thing - streaming doesn’t seem to work. It works fine in the playground, so I am just wondering if I am doing something wrong.

Vanilla code for testing:

import os
import openai

openai.api_key = os.getenv("OPENAI_API_KEY")

response = openai.Completion.create(
  model="gpt-3.5-turbo-instruct",
  prompt="Write me a poem"
  temperature=1,
  max_tokens=256,
  top_p=1,
  frequency_penalty=0,
  presence_penalty=0,
  stream=True
)

for chunk in response:
  print(chunk)

Always results in:

{
  "id": "cmpl-81gUZ3QrUBZMFsGzhS3sxRAappqlZ",
  "object": "text_completion",
  "created": 1695412359,
  "choices": [
    {
      "text": "",
      "index": 0,
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "model": "gpt-3.5-turbo-instruct"
}

It’s not a biggie, just to ascertain what exactly is happening here

The new turbo-instruct model uses the old completion endpoint not the new chat completion endpoint, so it doesn’t have a stream option…

You can find more on all the models here…

@PaulBellow according to OpenAI documentation, the legacy completion endpoint does have the stream option:

1 Like

You linked here…

Create chat completion

You can see legacy completions endpoint here…

ETA: Looks like completion should have stream too…

I wonder if the new turbo-instruct completion endpoint is different / new?

ETA2: Looks like the link for cookbook example is talking about gpt-turbo (chat endpoint) not gpt-turbo-instruct… So… might be a bug or new completion endpoint with docs not updated yet.

ETA3: Might try the example they used?

import os
import openai
openai.api_key = os.getenv(“OPENAI_API_KEY”)
for chunk in openai.Completion.create(
model=“gpt-3.5-turbo-instruct”,
prompt=“Say this is a test”,
max_tokens=7,
temperature=0,
stream=True
):
print(chunk[‘choices’][0][‘text’])

Correct, I already wrote a stateless utility.

Set the first variable to stream=False and get the alternate output method with tokens per second.

import time
import openai
stream=True
openai.api_key = key
system = """
An AI assistant replies to user input. It keeps no memory of chat.
assistant: I am a helpful artificial intelligence, capable of many human-like tasks.
""".strip()
user = "Write an introduction a user will see when they first start your chatbot program"
while not user in ["exit", ""]:
    stime = time.time()
    api_out = openai.Completion.create(
        prompt = system + "\n\nuser: " + user + "\nassistant:",
        model="gpt-3.5-turbo-instruct", stream=stream, max_tokens=666)
    ctime = round(time.time() - stime, ndigits=3)
    if stream == True:
        for chunk in api_out:
            print(chunk["choices"][0]["text"], end='')
        print()
    else:
        print(api_out['choices'][0]['text'].strip())
        ctokens = int(api_out['usage']['completion_tokens'])
        tps = round(ctokens / ctime, ndigits=1)
        print(f"-- completion: time {ctime}s, {ctokens} tokens, {tps} tokens/s --")
    user = input("==>")

I’ve tried the legacy completion endpoint with turbo-instruct model, with stream and it is working correctly. My test was using the simple-openai Java library

turbo_instruct_stream

ETA: The attached image was too small. Perhaps you can have a better image in this link:

1 Like

Nice, I got it to work in the end, but it seems somehow intermittent. Trying to find the diff in the params now

2 Likes

I just had it barf on me once after a few tokens, but that’s at temperature=1 so it might have just output the possibility of an “end” instead of finishing my multipart banana-peeling instructions.


PS if you don’t like their streaming, you can do your own expensive “streaming”. Ask for max_tokens =1. Add that token to the end of your prompt. Call the API again. End when you get a null. :smile:

(That actually has research use, like you can make your own top-k sampling (up to 5). What if you make a response out of only the second-best token choices?)