Assistants API is Killing Me

Definitely more expensive.
In terms of flexibility, there’s a trade-off for sure. It’s simplifies your solution somewhat, so the core functionality is handled by OpenAI. But, rigid with anything to do with retrieval, etc.

1 Like

wow I didn’t even think of the cost, I thought it would only be 2x more expensive than the completions

what are you using instead???

so you recommend using 3.5 over 4?

my use case is:
Medical Sales doctor ( it diagnoses users “symptoms” then sells a product from the knowledge base )

Just to clarify – there is barely any effort in infrastructure or coding by offloading the embedding to Pinecone.

You’re simply trading one store for another. Because Pinecone is SaaS you have zero infrastructure effort and it doesn’t require code.

If that route isn’t ideal, the GPT 3.5 shouldn’t be an issue being that it’s primarily powered by the RAG, moreso than zero-shot/training.

In both cases, I think you’re just finding a way to reduce compute which reduces cost. And the only real investment is the learning if you’ve never architected in this manner. However, that’s only about an hour of effort.

If you’d like help implementing either, I’m happy to share whatever I can.

I’m stuck using GPT-4. It does way better than 3.5 with conversational agents that need to be precise. In your case, sounds like the precision is very much needed. If Assistants is too expensive for you, then going with one of the above suggestions (e.g., building your own RAG system) will be the cheaper option (operationally).

That’s awesome of you, thank you! I’ve not played around with PineCone yet, but I’m reading the docs right now and I’ll see how that goes. I may ping you privately if I get stuck :slight_smile:

Even if you don’t use it for this project; it’s extremely valuable if doing any development with LLM’s/RAG and will be worth your time, I promise you.

One of our users: @mehulgupta2016154 recently wrote a book on LangChain.

They also shared their Youtube channel, which has some very quick/simple vector DB tutorials in Python.

This particular video utilizes ChromaDB (you can run locally), but it’s a nice conceptual view:

I promise I’m not going off-topic :wink: – I know we’re not talking about LangChain but it provides a library that streamlines portions of the process.


Thanks @DevGirl, much appreciated :slight_smile:


Thanks @DevGirl for the mention.

I build my own rag system in 10 minutes lol, I am no longer using the assistants API

Building a RAG system isn’t the challenge here. It’s all the prep that goes into the data and all the subsequent updates. Managing context, including metadata, testing, etc. With Assistants API, you just upload a file and it does the rest… and it seems to handle the documents quite well.

What’s your approach to the data?

But for more complex conversations GPT-4 performs better IMO

For those of you wanting to try out building your own RAG system I’ll throw out another option which is Vectra, the local Vector DB I created. It has full parity with Pinecone but it’s free. It also has a fairly robust document import mechanism. A number of projects are using it so it’s starting to get fairly polished:

I’d still recommend something like Pinecone for large scale production scenarios but Vectra is a great way to give RAG a try without having to spend money or create any sort of account

probably off topic on this thread, but thought I would ask anyways… What are the differences between Vectra and LanceDB?

I haven’t looked at LanceDB. Honestly I built Vectra because I’m a JS/TS developer and when I started tinkering with RAG there weren’t any JS based solutions. A friend (and now one of my employees) ported it to Python for me. In fact when I started exploring LLMs there weren’t any JS tools period so I ended up having to build my own entire stack. I’m an SDK Architect though so realistically I would have built my own stack anyway.

These days I’d would humbly say that I’ve become an expert in all things RAG and I have a whole slew of improvements coming to Vectra 2. I looked at LlamaIndex for instance and while it’s awesome I don’t like the file format. It’s too large. My goal with Vectra 2, and another library I’m building called DSTRUCTure, is too build the state of the art suite of tools for all things RAG.

1 Like

Ok so I did a quick scan of LanceDB and it looks cool. RUST core with numerous language bindings. But here’s my issue will all of these Vector DB’s. They only ever tackle one part of the RAG pipeline… the middle part, the DB. They leave it up to the developer to build the other parts of the pipeline Ingest & Render which are arguably the more difficult parts of the pipeline. You can use LangChain to help with the Ingest side but there’s lots of innovation left to be done on the Ingest side of things and nobody is really trying to innovate on the Render side of things. The work I did on Document Sections in Vectra is still the most significant effort I’ve seen to innovate on the Render side and there’s lots of room for improvement there.

I build tools for developers and my goal is always to create as turnkey a solution as I can for developers. The Teams AI Library I designed for example virtually eliminates the need for prompt engineering. I applaud innovation wherever I see it I always just prefer wholistic turn key solutions personally.

Maybe “LlamaIndex” is useful, it is open source, free (MIT License) and can build a RAG system with a few lines of code.

import os


from llama_index import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("YOUR_DATA_DIRECTORY").load_data()
index = VectorStoreIndex.from_documents(documents)

To query:

query_engine = index.as_query_engine()

I haven’t used it myself yet but it seems to be quite popular.

I just use Flowise / Langflow, it makes the whole process super easy, just connect nodes and it also does the rest, not gonna lie, the speed ended up being exactly the same as the assistants API, and the costs are similar, so I might go back to the Assistants

The api of the assistants isn’t slow, it’s the GPT4 being slow that makes the Ai take so long

Can these RAG solutions dynamically update the data they hold too? So if a user gave it relevant info in a prompt, it could store it for later use. Edit: just asked ChatGPT, yes they can. Might try it out but I will have use a Flask webserver or websocket as the platform my app is on won’t run Python code (I use a C# wrapper class for using the OpenAI API).