gpt-3.5-turbo-instruct is super impressive, however, noticed one really weird thing - streaming doesn’t seem to work. It works fine in the playground, so I am just wondering if I am doing something wrong.
Vanilla code for testing:
import os
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")
response = openai.Completion.create(
model="gpt-3.5-turbo-instruct",
prompt="Write me a poem"
temperature=1,
max_tokens=256,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
stream=True
)
for chunk in response:
print(chunk)
ETA: Looks like completion should have stream too…
I wonder if the new turbo-instruct completion endpoint is different / new?
ETA2: Looks like the link for cookbook example is talking about gpt-turbo (chat endpoint) not gpt-turbo-instruct… So… might be a bug or new completion endpoint with docs not updated yet.
ETA3: Might try the example they used?
import os
import openai
openai.api_key = os.getenv(“OPENAI_API_KEY”)
for chunk in openai.Completion.create(
model=“gpt-3.5-turbo-instruct”,
prompt=“Say this is a test”,
max_tokens=7,
temperature=0,
stream=True
):
print(chunk[‘choices’][0][‘text’])
Set the first variable to stream=False and get the alternate output method with tokens per second.
import time
import openai
stream=True
openai.api_key = key
system = """
An AI assistant replies to user input. It keeps no memory of chat.
assistant: I am a helpful artificial intelligence, capable of many human-like tasks.
""".strip()
user = "Write an introduction a user will see when they first start your chatbot program"
while not user in ["exit", ""]:
stime = time.time()
api_out = openai.Completion.create(
prompt = system + "\n\nuser: " + user + "\nassistant:",
model="gpt-3.5-turbo-instruct", stream=stream, max_tokens=666)
ctime = round(time.time() - stime, ndigits=3)
if stream == True:
for chunk in api_out:
print(chunk["choices"][0]["text"], end='')
print()
else:
print(api_out['choices'][0]['text'].strip())
ctokens = int(api_out['usage']['completion_tokens'])
tps = round(ctokens / ctime, ndigits=1)
print(f"-- completion: time {ctime}s, {ctokens} tokens, {tps} tokens/s --")
user = input("==>")
I’ve tried the legacy completion endpoint with turbo-instruct model, with stream and it is working correctly. My test was using the simple-openai Java library
ETA: The attached image was too small. Perhaps you can have a better image in this link:
I just had it barf on me once after a few tokens, but that’s at temperature=1 so it might have just output the possibility of an “end” instead of finishing my multipart banana-peeling instructions.
PS if you don’t like their streaming, you can do your own expensive “streaming”. Ask for max_tokens =1. Add that token to the end of your prompt. Call the API again. End when you get a null.
(That actually has research use, like you can make your own top-k sampling (up to 5). What if you make a response out of only the second-best token choices?)