I am trying to maintain a conversation with an assistant in my project.
But I am not sure if I understand the “runs” context correctly.
If I create a message and then create a blocking run, message response is added to the message list and I can retrieve it listing the messages.
If I create a streaming run instead, my expectation was to leave that stream open until I close it and receive consequent messages from that run.
However, after the first received message , that stream is closed by the API.
Does that mean , I have to create a new run each time I create a new message ?
If that’s correct, what’s the advantage of using a streaming run instead of a blocking run ?
You can think of a run like executing the actual command that makes the language model whir.
A “blocking” run basically means “Give me the output when you’re all finished.”
A “streaming” run means “give it to me slowly, in little bits, as you’re producing the output.”
When a stream is closed, it means it finished producing the output, and the command is finished. You can’t stream information when there’s nothing left to stream.
Technically, if you want the door open all the time, so you could let information stream through at any moment, this would be called a websocket. You cannot establish a websocket to a language model API, nor should you.
The advantage/disadvantage to each is about how fast you want to receive the output.
So, to put all this together, threads (and the messages within them) captures and packages the data in a way to prepare it so it can be sent to the language model to process. Executing a run is what actually does the procedure of creating an output. You can retrieve this output at the end when it’s finished, which would be a blocking run, or you can retrieve it as it’s being made, which would be the streaming run.