I’ve setup an Assistant which contains one book (a few MB in size) and one other document (a few KB in size) as references. My instruction set isn’t that large. Now that OpenAI have exposed the Threads with Token counts, I’m realizing how expensive this seems to have become. Some threads with only a few messages back and forth are coming to tens of thousands of tokens. I’ve guesstimated that the cost per message (each way) comes to somewhere between $0.07 and $0.10. That’s crazy.
Anyone else having issues? Am I doing something wrong here?
Assistants API determines that on its own. Again, supposedly one of the benefits of this service. There’s no way that I know of to control how much context the Assistant maintains or how much of the vector database to pull into the thread…
As others have suggested, a less expensive LLM such as 3.5 is an option because you’re relying on RAG and don’t require as much power in the LLM.
IMHO, the most effective option would be to chunk this, import into a Pinecone vector DB (it will be small enough to be run for free) and this will substantially reduce costs while potentially increasing accuracy, depending on how conducive the embedding is to a structured document/chunking.
The whole point of the Assistants API was to cut down on all the infrastructure and coding. I had a great system working with Chat Completions, but moved over to Assistants thinking I’d save so much time on dev. I guess nothing comes for free — save time, spend more money and end up with a slower system.
I’ve been using it since it first launched. It performed better initially. Performance has degraded. I assumed a new Assistant would need time to “warm up” before reaching peak performance. From my experience, it seems to be the opposite.
Let me know if you figure anything out. Thanks again for the responses.