The Assistant API can ask the developer to make function calls, and there’s a simple way to provide responses. In any case, 2 model calls are made. How do we get billed for these calls? What chat history does it use for the second call, with what prompt?
Does (stateful) Assistant threading include some smart to prevent reprocessing of the chat history? In other words, is it faster than the bare-bone (stateless) solution of pruning the chat history ourselves and resubmitting?
Welcome to the community!
I suspect no one really knows. I’m fairly certain you always get billed for all tokens in or out, regardless of what the model does.
Does (stateful) Assistant threading include some smart to prevent reprocessing of the chat history?
I don’t think so. You have the parameters run-max_prompt_tokens and run-truncation_strategy, but there’s nothing “smart” going on in the background to save you tokens.
Assistants are definitely slower, I’d say.