There are a couple of issues here from a generic use case perspective.
The first: in general, the architecture could be smarter not to limit it to user-trigger (“press button”). In your use case, it might be legitimate. However imagine porting it to Whatsapp, where you have no control of when the user decides to enter the new text.
The second: in my opinion, chats should be plain old text…not metadata embedded within text that now must be parsed out. Needless complexity.
The solution is to have debouncing (simply put delay sending to chat completion for duration; so that you can check if there are additional messages). Debouncing WAS difficult till the advent of AssistantApi. You can use AssistantApi to convert to Chatcompletion with debouncing. I will whip up a quick poc in a day or so.
In the meantime here’s my high level post (Switching from Assistants API to Chat Completion? - #2 by icdev2dev)