Force the Assistant to read the knowledge base files before generating the reply

At least it works!

If you are looking to simply say “Please write a Lead Generator email” then you could just separate your Assistants. Lead Generator Assistant, Webinar Assistant.

Otherwise:

It makes sense that you want to use RAG for this. Typically the retrieval runs on every chat - it’s just not doing what you expected.

The retrieval system is very… black boxxed. For a black boxxed AI. It’s just not very fun to work with.

Since you have a working product and can now focus on efficiency you can build your own simple RAG system. It’s a lot easier than most places make it seem.

You can use a powerful model like ada-3-lg and even give it full dimensions (3072) (although lower dimensions are more than suitable for most tasks) and then just create a simple JSON file to hold the embeddings. Your file would only be roughly 3kb - 42kb per item (you can reduce this to 2kb - 25kb using pickle/numpy/whatever)

Now you can run a simple dot product on the information using the initial query and your in-memory JSON file (you can use the numpy library to do this) and then have a mapping to the correct instructions/context.

It will be very fast, not cost anything (besides embedding the query), not require any external databases, and give you much more control over your RAG.

All you would do is compare these, and return the top 1. If you find bias in the results you can adjust the embeddings by applying a weighted centroid calculation on it (give 90% to the main embedding and then 10% to the bias you are trying to eliminate)

Oh. Wow. I’ve heard about this but am unsure how to go about it, or where to even start from.

Do I have to use Pinecone, or it’s something I can do locally?

Can you point me to a guide on how to do this?

Thanks for sharing your knowledge brother. I have sent a DM to you regarding an opportunity. Please take a look at it and let me know where you stand.

Well… I think I have found a solution, using Pinecone to create a RAG architecture.

A query that used 17000 tokens on assistants API used just 1400 tokens after I created my RAG architecture.

I used this Jupyter notebook from Pinecone as a template Google Colab

Hello, has your problem been resolved? We also have encountered a very similar issue.