I’ve setup an Assistant which contains one book (a few MB in size) and one other document (a few KB in size) as references. My instruction set isn’t that large. Now that OpenAI have exposed the Threads with Token counts, I’m realizing how expensive this seems to have become. Some threads with only a few messages back and forth are coming to tens of thousands of tokens. I’ve guesstimated that the cost per message (each way) comes to somewhere between $0.07 and $0.10. That’s crazy.
Anyone else having issues? Am I doing something wrong here?
That’s one of the downsides of the current Assistants API (and why I don’t personally use it) there’s no way to control cost and you could easily get to a point where you’re paying $0.50 per request.
Normally, this data should not be directly retrieved by the Assistant API, but rather should be embedded in a vector database and then called appropriately.
One of the benefits of the Assistants API is that it handles the vector/search functionality for you. Otherwise, using Chat Completions with your own vector/search solution would be the way to go…
Easy to separate if you have complex tasks. For a conversational bot, you’re either going to have good performance with GPT-4 or relatively poor performance with GPT-3.5…
Assistants API determines that on its own. Again, supposedly one of the benefits of this service. There’s no way that I know of to control how much context the Assistant maintains or how much of the vector database to pull into the thread…
It’s a good suggestion! But I’m only uploading a few MB. The Assistants API can handle GBs worth of data (supposedly). Can’t really limit it any further
As others have suggested, a less expensive LLM such as 3.5 is an option because you’re relying on RAG and don’t require as much power in the LLM.
IMHO, the most effective option would be to chunk this, import into a Pinecone vector DB (it will be small enough to be run for free) and this will substantially reduce costs while potentially increasing accuracy, depending on how conducive the embedding is to a structured document/chunking.
The whole point of the Assistants API was to cut down on all the infrastructure and coding. I had a great system working with Chat Completions, but moved over to Assistants thinking I’d save so much time on dev. I guess nothing comes for free — save time, spend more money and end up with a slower system.
I’ve been using it since it first launched. It performed better initially. Performance has degraded. I assumed a new Assistant would need time to “warm up” before reaching peak performance. From my experience, it seems to be the opposite.
Let me know if you figure anything out. Thanks again for the responses.