Assistants API is Killing Me

I’ve setup an Assistant which contains one book (a few MB in size) and one other document (a few KB in size) as references. My instruction set isn’t that large. Now that OpenAI have exposed the Threads with Token counts, I’m realizing how expensive this seems to have become. Some threads with only a few messages back and forth are coming to tens of thousands of tokens. I’ve guesstimated that the cost per message (each way) comes to somewhere between $0.07 and $0.10. That’s crazy.

Anyone else having issues? Am I doing something wrong here?

1 Like

That’s one of the downsides of the current Assistants API (and why I don’t personally use it) there’s no way to control cost and you could easily get to a point where you’re paying $0.50 per request.

Normally, this data should not be directly retrieved by the Assistant API, but rather should be embedded in a vector database and then called appropriately.

Change the model of your assistant. Use gpt 3.5. A request is less then 1 cent usualy.

4 Likes

One of the benefits of the Assistants API is that it handles the vector/search functionality for you. Otherwise, using Chat Completions with your own vector/search solution would be the way to go…

I’ve thought of doing that, but in some cases GPT-4 performs way better than 3.5 :frowning:

1 Like

Of course! That’s why I handle request with gpt 3.5 and delegate more complex to gpt 4.

2 Likes

Please reduce the data to the minimum necessary and limit the amount used during retrieval, which may also be helpful.

Easy to separate if you have complex tasks. For a conversational bot, you’re either going to have good performance with GPT-4 or relatively poor performance with GPT-3.5…

Assistants API determines that on its own. Again, supposedly one of the benefits of this service. There’s no way that I know of to control how much context the Assistant maintains or how much of the vector database to pull into the thread…

For discussion, there is not much difference between gpt3.5 and gpt4, I mean small talk.

1 Like

To be honest in most cases I prefer GPT-3.5 for the vast majority of tasks. GPT-4 tends to overthink things.

1 Like

I apologize for any confusion.
What I was referring to was the size of the data when uploading…

I understand that the benefit of an assistant lies in automated retrieval.

I thought that by limiting the data to be uploaded, it might help save some costs, but I apologize if my suggestion was off the mark.

It’s a good suggestion! But I’m only uploading a few MB. The Assistants API can handle GBs worth of data (supposedly). Can’t really limit it any further :frowning:

1 Like

For the time being, put together your own local RAG agent.

Here are some code examples:

(A fully featured, stable, plugin for Discourse, written in Ruby (on Rails) )

(A Python implementation)

1 Like

You’re not doing anything wrong.

As others have suggested, a less expensive LLM such as 3.5 is an option because you’re relying on RAG and don’t require as much power in the LLM.

IMHO, the most effective option would be to chunk this, import into a Pinecone vector DB (it will be small enough to be run for free) and this will substantially reduce costs while potentially increasing accuracy, depending on how conducive the embedding is to a structured document/chunking.

2 Likes

You aren’t doing anything wrong.
It would be beneficial if the Assistant API could retrieve data from files more efficiently.

I also need to learn more about the Assistant API.

The whole point of the Assistants API was to cut down on all the infrastructure and coding. I had a great system working with Chat Completions, but moved over to Assistants thinking I’d save so much time on dev. I guess nothing comes for free — save time, spend more money and end up with a slower system.

I’ve been using it since it first launched. It performed better initially. Performance has degraded. I assumed a new Assistant would need time to “warm up” before reaching peak performance. From my experience, it seems to be the opposite.

Let me know if you figure anything out. Thanks again for the responses.

I’d say:

  • less flexible
  • more expensive

(at this point)