N chat completion choices with follow-up replies

I want to get the following chats:

user: [large first prompt]
assistant: [first response]
user: [second short follow-up prompt, static, doesn’t depend on the contents of the first response]
assistant: [second response]

I also want to have N choices for the first response, but, let’s say, only a single choice (N=1) for the second response.

Currently, there is no way to achieve this without re-processing the large first prompt, which is a waste.

The way to think of this is that each API call is stateless, OpenAI do not keep track of your particular API calls in any meaningful way computationally. Large language models need to be told all of history each time they are called.

I understand of course that this is not implemented in the current API, I just point out that this would be a nice feature to have. Also, to implement this feature, the API doesn’t need to become stateful, it just needs to add a way to send more complex requests, where I specify the follow-up prompt beforehand.

You do not appear to know how the models work.

What you are describing is not possible with the current architecture.

The models process the entire context all at once, that is the only way they work.

@elmstedt of course this is possible, see e.g. github com/guidance-ai/guidance#guidance-acceleration-notebook

First, thank you for linking that project. I looks like an interesting project so I’m reposting it here as a clickable link for others,

Having said that, while there are some interesting things in there, I’m not sure it does exactly what you are proposing.

But the project supports the OpenAI models, so if you are convinced it does work the way you think, I would encourage you to experiment with it and report back with examples showing equivalent results with fewer tokens used.

@elmstedt sorry for repeated low-context responses, which have led to misunderstanding. I could have done better.

This is not implemented in OpenAI API at th moment. The whole point of my post on this Forum board is to suggest that it would be a nice thing for OpenAI to implement.

I meant that Transformer architecture doesn’t preclude such interleaved prompts and LLM generation. Guidance acceleration currently supports only open-source models because, evidently, no cloud API provider (neither OpenAI nor Anthropic) implements the necessary API. This doesn’t require making API itself stateful, it would stay stateless, just become more complex.

Guidance developers explicitly say that API developers should do this in this comment: github com/guidance-ai/guidance/issues/115#issuecomment-1563378295