To my understanding when adding messages to a thread and running, you increase the context sent in each inference up to the maximum of the models context size before it starts shifting the context window losing older messages from the thread.
Worst case scenario is each new message you add you are pretty much hitting 128k tokens each time?
Because I was working on an awesome coding assistant I was so engaged it did not take me long to burn through 30 dollars
I think roundtripping code snippets, etc got quite large. I know the price is 6x less now though for hobby projects its quite expensive.
Is there a way to specify the token window when you run? So it does not send every message in a thread but something you can specify like only the last 10, etc?
That would be a useful feature, I could create a new thread each time but since we can’t list threads I have no idea where all the threads are going since I am not keeping track of them
Anyway, despite its an expensive hobby, I am loving assistants API! very cool, I wonder though if I should just be using the chat completions endpoint since the most impressive thing I found is functions and maybe I can have better control if I use completions with function support myself?