Multiple messages per turn in Chat Completions API

Hello!

I’m building a chatbot using the Chat Completions API.

The format of a conversation is usually like this:

System: [system prompt]
User: Can you write an email for me?
Assistant: Sure! What do you want to talk about?
...

However, I’m looking to model more natural conversations - such as the ones you’d have when you text a friend. Something like this:

System: [system prompt]
User: Hey!
User: How are you?
Assistant: It's a great day today
Assistant: I've been doing this and that
Assistant: What about you?
...

The challenge I’m facing is that the Chat Completions API generates only one completion at a time, and even if I can call the endpoint multiple times… When should I stop generating new assistant messages?

Some ideas I had:

  • Getting the API to generate completions as a user too. If the next message says “user”, then I know there’s a turn change and I simply discard that message .
  • Training an encoder model to predict turn changes. It would get the whole conversation as an input and output True/False to indicate whether there’s a turn change (=don’t generate more assistant messages - wait for the user’s response).
  • Sentiment analysis. If the assistant’s response is an enquiry/question, then assume there’s a turn change.

Those are some workarounds, but maybe there’s another (simpler) way to do it? I’d love to hear your ideas!

Welcome to the community!

Sounds like a cool project.

My question is, how would this scenario

System: [system prompt]
User: Hey!
User: How are you?
Assistant: It's a great day today
...

ever be able to happen? (the user getting two turns) :thinking:

does your agent have an artificial cold start time?


All that said, multiple agent turns seem pretty straight forward: just fake them.

you can insert some control sequences into the output. that would even allow you to fine tune the behavior if you wanted to.

It's a great day today
¬<<sleep 2s>>
I've been doing this and that
¬<<sleep 1s>>
What about you?

Including a rare symbol like ¬ will let your parser know when to stop streaming and when to start parsing.

something along those lines.

what do you think?

1 Like

Thanks for the smart ideas!

To answer your first question, the user will write any number of messages and then press a button/write a special command to “trigger completion”. This could also be enhanced by using a cold start time, as you say.


On the second proposal, I do think that’s a nice way to solve it - the only issue is that I would need to fine-tune the model (again) to have it output <new-message> tokens. Essentially, we’d be turning the system into a one-message-per-turn architecture, which would work.

If there are no better possibilities, I’ll probably end up doing what you suggested:

System: [system prompt]
User: Hey! <new-message> How are you?
Assistant: It's a great day today <new-message> I've been doing this and that <new-message> What about you?
...

My only additional concern would be the <new-message> confusing the system. In other models, it’s possible to define custom tokens that can be added to the foundation model’s dictionary (maybe that’d have been a cleaner alternative). However, I can’t do this here.

So I’ll need to introduce some kind of special token like the one you suggested, or the one I’ve used in the example. It’s probably best to stick to a single token (instead of mine, which would likely be broken down in several tokens), for cost reasons and also because I’d imagine it would cause a reduced drop in performance when starting the fine-tuning process. What do you think about this?

I’d skip the fine tuning and try to work with prompts. It was just an idea.

And if you don’t plan on using instant streaming, then we can skip the special token too and opt for using a schema.

you are chatterbot. you are simulating realistic, natural chat conversations. A participant can send multiple messages in a row, in the following schema:
{string|number}[]

the string represents a message, the number a pause (in seconds).

here’s an example:

user message:
[“hello.”]
your response:
[“Hi! How are you?”, 3, “It’s a great day today, isn’t it?”]

Your answer always needs to be JSON compliant. Always start your answer with [

something like that.

There are a couple of issues here from a generic use case perspective.

The first: in general, the architecture could be smarter not to limit it to user-trigger (“press button”). In your use case, it might be legitimate. However imagine porting it to Whatsapp, where you have no control of when the user decides to enter the new text.

The second: in my opinion, chats should be plain old text…not metadata embedded within text that now must be parsed out. Needless complexity.

The solution is to have debouncing (simply put delay sending to chat completion for duration; so that you can check if there are additional messages). Debouncing WAS difficult till the advent of AssistantApi. You can use AssistantApi to convert to Chatcompletion with debouncing. I will whip up a quick poc in a day or so.

In the meantime here’s my high level post (Switching from Assistants API to Chat Completion? - #2 by icdev2dev)