I’m looking for something like the option to interrupt in the playground. You can just cancel the stream if you see that it’s going in the wrong direction, though I’m not sure if this is really telling the server to stop computing.
I’m not looking for Stop Sequences. I just want the user of my app to have the ability to quickly stop and try something else (and possibly not waste all tokens).
@lemi did you figure this out? I have the same question: ability to stop steaming, not via stop sequences, to save token costs when it is clearly going in an unproductive direction (e.g. repetitions).
According to my research, this should do the trick, since openai.Completion.create uses requests under the hood:
response = openai.Completion.create(
# Other stuff...
stream=True,
)
try:
for stream_resp in response:
# Do stuff...
if thing_happens:
break
except Exception as e:
print(e)
finally:
response.close()
I came up with the same solution, which also works on my end. Though, while I’m sure the server-side of this equation will necessarily generate at least a couple more tokens than is received by the client, what I was hoping for was some assertion from OpenAI (or, from someone who’s done some meticulous testing with this method to determine whether they are charged for the total sum of tokens that WOULD have been received) that once (more or less) the connection is no longer open from the client side, that the server necessarily stops generating tokens.
Please introduce GPT model structure as detail as possible
And let the api print all the token’s. The statistic result from OpenAI usage page is (I am a new user and is not allowed to post with media, so I only copy the result):
17 prompt + 441 completion = 568 tokens
After that, I stop the generation when the number of token received is 9, the result is:
17 prompt + 27 completion = 44 tokens
It seems there are roughly extra 10 tokens generated after I stop the generation.
Then I stop the generation when the number is 100, the result is:
17 prompt + 111 completion = 128 tokens
So I think the solution work well but with extra 10~20 tokens every time.
This is the main function i call via api to ask question
def ask_question(request):
thread = ChatThread(request)
thread_references[thread.getName()] = thread # Store a reference to the thread
thread.start()
# Save the thread reference to the database the thread name
"thread.getName()"
i use django model
thread_reference = ThreadReference(thread_name=thread.getName())
thread_reference.save()
The thread class
class ChatThread(threading.Thread):
def __init__(self,request):
self.request = request
self.response = None # Store the response object to stop response.close()
self.stop_event = threading.Event() # Event object to signal thread to stop
self.thread_name = str(uuid.uuid4()) # Generate a unique thread name using UUID
threading.Thread.__init__(self, name=self.thread_name)
def run(self):
try:
channel_layer = get_channel_layer()
i = 0
# Simple Streaming ChatCompletion Request
generated_content = []
self.response = openai.ChatCompletion.create(
model='gpt-3.5-turbo',
messages=[
{'role': 'user', 'content': self.request.data.get('question','')}
],
temperature=0,
stream=True
)
for chunk in self.response:
# time.sleep(3) # you can use to slow down if needed for testing
content = chunk["choices"][0]["delta"].get("content", "")
finish_reason = chunk["choices"][0].get("finish_reason", "")
if(finish_reason!="stop"):
data = {"current_total": i, "content": content}
self.stop_event.set() # Set the event to stop the thread
else:
data = {"current_total": i, "content": "@@"+finish_reason+"@@"}
generated_content.append(content)
async_to_sync(channel_layer.group_send)(
f"chat_{self.request.data.get('chat_room','')}",{
# type is the function called from consumers
'type':'send_notification',
# this is the value send to send_notification function in consumer
'value': json.dumps(data)
}
)
i += 1
combined_content = ''.join(generated_content)
except Exception as e:
print(e)
def stop(self):
self.stop_event.set()
if self.response:
self.response.close() # Close the response if it exists
to close the response and thread i use apis in django rest rest_framework
@api_view([‘POST’])
def stop_thread(request):
thread_name = request.data.get(‘thread_name’) # thread_name is the uuid that is saved in your database
# Get the thread reference from the database
thread_reference = get_object_or_404(ThreadReference, thread_name=thread_name)
if thread_name in thread_references:
thread_references[thread_name].stop()
del thread_references[thread_name]
# thread_reference.delete() # delete the thread name from database if you use django
return Response({'message': f'Thread {thread_name} has been stopped.'})
FWIW, I repeated @Ashton1998’s experiment with curl -N and got the same results. So, there is no special event sent to the API, and just closing the server-sent event stream from the client side is sufficient to stop the generation.