New "Assistants" API a potential replacement for low level "RAG" style content generation?

[my original post was a brain dump of various notes about the new Assistants API and vector database RAG style chat augmentation. I asked GPT4 to translate my notes and thoughts into a more structured format, which you’ll find below.]

Curious if I’m thinking of this correctly, or if Assistants/Files (messages/threads) is not comparable to vector database style RAG generation?

Certainly, the diagram you’ve provided outlines the traditional RAG workflow, where a query is processed through an embedding model to retrieve context from a vector database, which is then fed into a Large Language Model (LLM) like GPT to generate a response, potentially evaluated against a reference response.

Here is a proposed workflow that leverages the new Assistant’s API capabilities, analogous to the RAG method:

1. Query Reception:

  • The user submits a query to the Assistant.

2. Preprocessing and Embedding:

  • The Assistant’s API processes the query and uses its internal mechanisms (potentially an embedding model, though specifics may not be publicly disclosed) to understand the context of the query.
  • The API then identifies key concepts within the query that need additional context for a more accurate response.

3. Retrieval from Files/Threads:

  • The API searches through attached files or threads linked to the Assistant or the specific conversation. These files are pre-indexed and stored, analogous to a vector database, but the content is in raw text form or in some structured format like JSON, CSV, etc.
  • It retrieves the relevant content based on the query’s context, much like the retrieval of context vectors in the traditional RAG workflow.

4. Response Generation:

  • With the relevant context in hand, the Assistant’s API leverages its LLM (like GPT) to formulate a response. The LLM uses the provided context to inform its generation, ensuring the response is accurate and relevant to the user’s query.
  • The response may be a direct answer, additional questions for clarification, or a more complex response that incorporates data from the retrieved documents.

5. Quality Scoring (Optional):

  • If there is a mechanism within the Assistant’s API to evaluate the quality of the response (for example, through some form of scoring or feedback system), it can be applied here.
  • The API could potentially adjust future responses based on continuous learning from user interactions and feedback, improving over time.

6. Delivery of Response:

  • The generated response is then delivered to the user.
  • The system logs the interaction, which could be used for future reference or to further train the model on user-specific queries and information needs.

7. Reference Response (Optional):

  • If there is a known good response (a reference response), it could be used to compare the quality of the generated response. This comparison could be manual or automated depending on the system capabilities.

Feedback Loop:

  • User feedback on the response quality could be incorporated into the system, helping to refine the process for future queries.

Here’s a simplified flow diagram for this workflow:

User Query -> Assistant's API -> Preprocessing/Embedding -> Retrieve from Files/Threads -> Generate Response -> Deliver Response
                                                        Quality Scoring (Optional)

This workflow retains the essence of the RAG architecture while leveraging the new functionalities offered by the Assistant’s API, such as the ability to reference external data sources more seamlessly and to have ongoing threads that maintain context over the course of multiple interactions.


So far, what I’m seeing is that one needs to pass a list of all the files as a parameter. This doesn’t seem like the most optimal approach since it would mean the embeddings have to be created each time. Anybody know if we can create the embeddings once and use it for subsequent questions?

1 Like

Looks like for longer docs it will do a vectorized search…

"How it works

The model then decides when to retrieve content based on the user Messages. The Assistants API automatically chooses between two retrieval techniques:

  1. it either passes the file content in the prompt for short documents, or
  2. performs a vector search for longer documents

Retrieval currently optimizes for quality by adding all relevant content to the context of model calls. We plan to introduce other retrieval strategies to enable developers to choose a different tradeoff between retrieval quality and model usage cost."

So it indeed does emebed the docs - I’m still digging, but only hang up I’m seeing is the 20 doc limit, but honestly you can do a lot with 20 docs that have a 512 limit… I’m not sure how it charges yet for the embedding functionality (if it does at all!) if it doesn’t charge directly for embedding via this method it will be an insane game changer for my particular use case.

1 Like

looks like it’s pretty pricey.
it uses the selected model for embeddings, and that has cost.
20c/GB/assistant/day adds up fast.

Early hot take - this is great for prototyping but once production cost is a consideration other providers for RAG are still necessary.


When the information is provided in text or CSV format, we can compress an extensive volume of content—potentially tens of thousands of pages—into a file smaller than 10 megabytes so the limit of 20 files seems arbitrary for some use cases.
If the associated system does not incur charges for embedding, or even if it imposes a singular fee for this process, it represents a significant paradigm shift. This innovation could make the use of rudimentary RAG systems obsolete.

Having dedicated approximately two months to mastering LlamaIndex, I recognize there is still a vast amount to learn. In contrast, this Assistant system appears to easily navigate and resolve complex issues like Chunking, Embedder Fine Tuning, Re-ranking, and Chunk Relationships simultaneously.

Where can we get more information about the inner algorithms of the Assistant system? I am mainly interested to know how much they will charge for the embeddings.

1 Like

Exactly what I am thinking. The limit being 512mb just means that running some compression on documents would make it so there is more data - Say pdfs>txts as well as long texts > dense summaries. Depending on use case, this is more than enough.

And the cost is worth it, I think. It ends up being $3 per month per assistant and you don’t have to do your own chunking, embedding, and re-ranking.

Current issues after some initial testing is the lack of traceability (where did you get the data exactly?) and getting anecdotes as evidence (Why is x product bought? user said Z). For the first, I see that it gives an “amount” of sources, so there must be some training or feature having to do with it. For the second, prompting might help though I haven’t figured it out.

Easy way to get ‘locked’ into OpenAI ecosystem.

1 Like

Does anyone know if the file size of 512MB has any token size limit? Someone had posted that there was a 1.8 Million token/file limit. is that true?

1 Like

Its mentioned here

The maximum file size is 512 MB and no more than 2,000,000 tokens (computed automatically when you attach a file).

1 Like

Your post inspired me a lot. Do you have the demo code? Thank you so much.
I have a few questions about how to implement the assistant code.
Should the file be placed in the assistant or in the thread?(I want to be an assistant for system administrators to query, but I encountered the above difficulties.)