Modifying Mid-Response Generated Output in GPT-3.5 / GPT-4

I am currently facing a challenge with GPT-3.5 and GPT-4 models. As part of my use-case, I need to modify a part of the response that the model is generating while it is in the process of generating it. The goal is to have the model continue generating the remainder of its response, taking into account the modifications I’ve applied.

However, I am encountering an issue. When I intervene in the middle of the response generation, make a change to the last generated token for example, and then request the model to continue, it often restarts the sentence from the beginning, ignoring the change I made. Here’s an example to illustrate the issue:

Initial Interaction:

  • User: “Hi, What’s 3^1.4?”

  • GPT-4 (temperature=0): “3 raised to the power of 1.4 is”

At this point, I stop the model, and modify the last token "is " with "should ". I then call the model to continue its response.

Continued Interaction:

  • GPT-4: “3 raised to the power of 1.4 is approximately …”

As you can see, instead of continuing from where it was left off, the model begins the sentence again. While there are instances when the model does continue from where it was interrupted, it seems to be inconsistent. My hypothesis is that the model is more likely to restart the sentence when the token I’ve modified or inserted is not a likely token to occur in the given context (which would have caused bad performance anyway and OpenAI tuned it that way?).

My question is: Is there a way to compel the model to continue its response after the modifications have been made, regardless of the likelihood of injected tokens? I tried to play with frequency and presence penalties and while it improves the situation in some cases, even at their maximum values, they don’t seem to work reliably so I’m probably missing a point on how the output gets generated. Any insights or suggestions would be greatly appreciated.

One of the use cases is when you want to inject an output of an API call to the model’s output while making the user experience as realtime as possible (and you want the model to also be aware of the injected data). For example:

  • GPT-4: "The weather today is {{ get_weather() }} " → “24 degrees…” - the {{ get_weather() }} needs to get modified into “24” in realtime as the model will continue using that data in its “same” response).

Thank you for your assistance in advance.

Hi @max11rsl

Welcome to the community.

what code are you doing this with?

I tried doing it both in the playground and in python. Ultimately I’m doing the parsing through python.

You’ll need to combine it with an instruction.

“Continue this sentence, without repeating it”, and so on.

I have, in my own code, this, to try to “continue” after a length has been reached, for another reason:

{role: 'assistant', content: 'content from before'},
{role: 'assistant', content: `Oops! Looks like my last message was cut off.

I will continue from where I left off in my response after my last line as if nothing happened, ensuring I will not repeat anything, here is the rest of my response:`

Works quite well so far ! Many thanks @firtina.

1 Like

Have you seen the recently announced microsoft ‘Guidance’? Claims to do this and much more. Works with both GPT and HG Transformers. Uses the Azure API for gpt, but since they have released the source, I imagine you could modify to use OpenAI api.
(Available on github)


That sounds quite interesting @bruce.dambrosio , can you please share a link to the repo?


Wow ! I’ve been reading this for the past 2 hours. There is quite a depth to it and I’m glad you shared this as I was about to get started with pretty much a simpler version of the same project. This is super interesting.


One thing I found frustrating so far (I haven’t tried converting the gpt interface yet) is that when trying to test it with a the Transformer option, it seems to take forever to load the model (even after it has been downloaded and cached). It gets there eventually, but wow. I’m planning to next try converting the gpt api to use openAI instead of Azure, and/or converting the Transformer interface to use a fastchat server I have running.
I’ll post progress here.

Sounds good, I’ll be exploring the functionality as well in the meantime. Curious if it’s in any way possible to achieve guidance acceleration type of caching behaviour from OpenAI api because with GPT4, the speed optimizations to me are extremely important.

my error, it looks like openai is directly supported, with this caveat

“When calling OpenAI chat models you must generate only directly inside the assistant role! The OpenAI API does not currently support partial assistant prompting.”

1 Like