Restarting partially completed chat completion API calls

If I use the streaming API and a given query stops either because I run out of tokens or a timeout (or I stop it because is running slowly), is it possible to restart the query using the partial response I already have? In the case of running out of tokens in a 8K GPT4 model, I could potentially just resume on the 32K GPT4 model. This seems like it should be conceptually possible, but I can’t see an obvious way to do this with the existing API.

Yes, if you place the assistant response as the last message with no user input, the AI will usually continue upon that if it looks unfinished in the context of the question posed.

sometimes, but sometimes it will just ignore it and start from scratch, depending on the model.

sometimes you need to add an extra instruction to tell the model to continue where it left off.

Here for example, away it goes again, even though I truncated a ChatGPT response at the clean end of a section 2/3 of the way in:

Or force the AI to keep writing until it has no more text to predict.

The only side-effect is the continuation is trained on briefness by the previous outputs.

The side effect of prompting is the possibility of “Sure, let me continue where we left off before the interruption you report.”

(btw, this is having gpt-3.5-turbo (latest) complete on GPT-4 writing. Being informed by previous verbosity. Prompt gpt-3.5-turbo and you have vastly reduced quality.

gpt-3.5-turbo complete


  • Provide a diet of whole fish, such as herring or capelin, as well as squid and krill.
  • Feed them twice a day to ensure they get enough nutrients.


  • Maintain a cold environment with a temperature between 32-45°F (0-7°C).
  • Keep their habitat clean by regularly changing the water and providing a clean area for them to rest.


  • Provide opportunities for swimming and diving to mimic their natural behaviors.
  • Offer toys and objects to play with to keep them mentally stimulated.


  • Regularly check for signs of illness such as changes in behavior or appetite.
  • Consult with a veterinarian who specializes in exotic animals for routine check-ups and medical care.


  • Consider getting a companion for your penguin to prevent loneliness.
  • Interact with your penguin regularly to build trust and bond with them.

Legal Considerations:

  • Check local regulations to ensure you are allowed to keep a penguin as a pet.
  • Obtain any necessary permits for owning an exotic animal.

Overall, caring for a pet penguin requires dedication, proper nutrition, a suitable habitat, enrichment, healthcare, socialization, and adherence to legal guidelines.

that’s cheating* :laughing:

if it’s cut off mid-sentence it sometimes tends to start over

*but it could be an interesting strategy, deleting everything until the last period or something :thinking:

Conceptually it should be possible to stop the generation of the AI response at any point and restart such that the final results is no different than having generated the entire response in one go (modulo differences with random number generators). It appears that issuing a query that ends with an assistant response isn’t really the same thing though.

it should be mentioned (but you probably know this) that it IS possible (and trivial - the default behavior) with the legacy completion endpoints

1 Like

I assume legacy completions will eventually go away though?

Interesting perspective. I found a point where gpt-3.5-turbo-0125 or gpt-4-turbo (0125) would start over in the penguin prose.

GPT-4-0613, no hiccup. gpt-3.5-turbo-0613, no problem.

Latest models broke completion (among other deoptimizations) where the deprecations guide specifically recommends chat as an edit replacement, and previously had gpt-4 pointed at also to replace completions.

I found the most performative against the new behavior is a user message “[continue AI completion]”

However, the messages being wrapped in a container for “ChatML”, and an unseen “assistant” prompt, means the flow is broken up and the AI is re-prompted.

1 Like

Yeah. But what I mean is that under the hood the ChatML query gets translated into a sequence of tokens with special tokens at the beginning/end of each section. If you had the model running locally, you could have design the infrastructure to not have the end token for the last assistant and thus restart with identical behavior.

I hope they make a come back! :laughing:

I want gpt-4-instruct!

all most of the open source models are instruct based, and do support that behavior.

My main motivation for this is I sometimes see queries that are outliers and take much longer than they should to complete. Ideally I would have the responses streaming and stop the slow queries and restart it from where it left off. I’d like to do this programmatically in a library. Right now, the best I can do is restart the entire query, but that takes longer and cost more.

1 Like

Why stop the slow query? The new one adds latency before you get its next token; you can send it in the background and see if it is on a generation path to catch up and surpass. Then replace output when its going to win.

Just need a good determiner for the edge case of “too slow, faster will be obtained”, so you aren’t always paying for two inputs.

1 Like