Assistants vs stream api tradeoffs

wondering what are the actual trade offs.
Base assumptions:
My instructions RAG is a text doc around 500 lines and keeps updating daily.
an average conversation is about 10 user messages
here is what I’m currently thinking:
assistants API:
stateful so just sending new message each time.

from my experience, updating the instructions text every time and managing the same file on git is a pain

streaming API:
stateless so each new message needs to pass the entire RAG + message history all over again - in terms of cost.

updating the RAG happens only in one place and it sends every new conversation

You have a misconception there about cost. Assistants sends past chat to a model also, and it is less under your control for budgeting.

“Streaming” on API refers to receiving text as it is generated by the model, instead of waiting for the entire response to finish.

You likely are referring to “chat completions” API endpoint, where you send directly to the model instead of letting an “agent” do that for you with its server-side conversation manglement.

RAG, retrieval-augmented generation, refers in practice to larger documentation’s sections placed into AI context by their similarity to user input, before the AI starts generating. “Retrieval” is obtaining from a vector database semantic search.

You can just call this “daily static knowledge” if you would like a variable name for your code.

You can update a single API assistants’ instructions with one API call. You can send additional_instructions per run that could be your daily knowledge outside of the standard behavior.

Then, on Chat Completions, that “stateless” means you have full control of everything seen in a conversation turn.

Hope that clarifies those trade-offs for you.

I sometimes hit the rate limit on completion api since sending the entire conversation on each new message counts as new tokens (correct me if i’m wrong)
in terms of rate limit, isn’t it better to use the assistant api since it’s saving the state
and I just have to send a new message on every new call?

The AI model is still called with all the tokens needed for a conversation history when you use Assistants.

It is the AI model context input and output that impacts rate limit (and your bill)

Assistants is oblivious to what your rate limit is, and it can make multiple calls on its own without delay (such as using file search which is sent again to the model). That makes it MORE problematic.

I thought that too… and I also thought that using the Assistants API would cut costs because I’d be sending only the current conversation and not the original RAG.
Unfortunately that isn’t the case.
Think of the Assistants API as a wrapper around the Streaming API that contains a state management structure… that’s it…
Under the hood, it still does the same old thing… I think…