Interrupting completion stream in Python

Is it possible to interrupt completion stream and not waste tokens? E.g. when I see that it’s looping or going in the wrong direction.

I know I can use stream option and then use the response object like a generator.

response = openai.Completion.create(
for line in response:

But is it enough to just jump out of the loop when I decide enough is enough? Will the server then stop generating the rest of the tokens?


I think what you are looking for is Stop Sequences here is a short guide on how to use them How do I use Stop Sequences? | OpenAI Help Center

1 Like

I don’t think there’s a way to stop it once the stream starts, but I may be wrong.

I’m looking for something like the option to interrupt in the playground. You can just cancel the stream if you see that it’s going in the wrong direction, though I’m not sure if this is really telling the server to stop computing.

I’m not looking for Stop Sequences. I just want the user of my app to have the ability to quickly stop and try something else (and possibly not waste all tokens).


@lemi did you figure this out? I have the same question: ability to stop steaming, not via stop sequences, to save token costs when it is clearly going in an unproductive direction (e.g. repetitions).


Hey guys,
did you figure out this issue? or any alternative solutions?

According to my research, this should do the trick, since openai.Completion.create uses requests under the hood:

response = openai.Completion.create(
    # Other stuff...
    for stream_resp in response:
        # Do stuff...
        if thing_happens:
except Exception as e:

I came up with the same solution, which also works on my end. Though, while I’m sure the server-side of this equation will necessarily generate at least a couple more tokens than is received by the client, what I was hoping for was some assertion from OpenAI (or, from someone who’s done some meticulous testing with this method to determine whether they are charged for the total sum of tokens that WOULD have been received) that once (more or less) the connection is no longer open from the client side, that the server necessarily stops generating tokens.


did anyone figure this out? (if it actually stops generating tokens in the backend)

1 Like

I make a simple test for @thehunmonkgroup 's solution.

I make a call to gpt-3.5-turbo model with input:

Please introduce GPT model structure as detail as possible

And let the api print all the token’s. The statistic result from OpenAI usage page is (I am a new user and is not allowed to post with media, so I only copy the result):
17 prompt + 441 completion = 568 tokens

After that, I stop the generation when the number of token received is 9, the result is:
17 prompt + 27 completion = 44 tokens

It seems there are roughly extra 10 tokens generated after I stop the generation.

Then I stop the generation when the number is 100, the result is:
17 prompt + 111 completion = 128 tokens

So I think the solution work well but with extra 10~20 tokens every time.


Excellent deductive and data driven results, thank you for posting them :smiley:

Stop the streaming data of api

This is the main function i call via api to ask question

def ask_question(request):

thread = ChatThread(request)
thread_references[thread.getName()] = thread  # Store a reference to the thread

# Save the thread reference to the database the thread name


i use django model

thread_reference = ThreadReference(thread_name=thread.getName())

The thread class

class ChatThread(threading.Thread):

def __init__(self,request):
    self.request = request
    self.response = None  # Store the response object to stop response.close()
    self.stop_event = threading.Event()  # Event object to signal thread to stop
    self.thread_name = str(uuid.uuid4())  # Generate a unique thread name using UUID
    threading.Thread.__init__(self, name=self.thread_name)

def run(self):
        channel_layer = get_channel_layer()
        i = 0
        # Simple Streaming ChatCompletion Request
        generated_content = []
        self.response = openai.ChatCompletion.create(
                {'role': 'user', 'content':'question','')}
        for chunk in self.response:

            # time.sleep(3) # you can use to slow down if needed for testing
            content = chunk["choices"][0]["delta"].get("content", "")
            finish_reason = chunk["choices"][0].get("finish_reason", "")
                data = {"current_total": i, "content": content}
                self.stop_event.set()  # Set the event to stop the thread
                data = {"current_total": i, "content": "@@"+finish_reason+"@@"}

                            # type is the function called from consumers
                            # this is the value send to send_notification function in consumer
                            'value': json.dumps(data)
            i += 1
            combined_content = ''.join(generated_content) 
    except Exception as e:

def stop(self):
    if self.response:
        self.response.close()  # Close the response if it exists

to close the response and thread i use apis in django rest rest_framework

def stop_thread(request):
thread_name =‘thread_name’) # thread_name is the uuid that is saved in your database

# Get the thread reference from the database
thread_reference = get_object_or_404(ThreadReference, thread_name=thread_name)

if thread_name in thread_references:
    del thread_references[thread_name]

    # thread_reference.delete() # delete the thread name from database if you use django 

return Response({'message': f'Thread {thread_name} has been stopped.'})

FWIW, I repeated @Ashton1998’s experiment with curl -N and got the same results. So, there is no special event sent to the API, and just closing the server-sent event stream from the client side is sufficient to stop the generation.

1 Like