How knowledge base files are handled (Assistants API)

Hello,

I have a question related to how the Assitants work in relation to the files we add as knowledge base (using the Assistant API, at least, I think it’s like that in Playground too).

My question is about how GPT treats these files. Based on the cost I am having when using Assistant API, it seems that every time we query the assistant it reads the knowledge base document.

If this is the case. It is not converting the document into embeddings but integrating it somehow with the query prompt. What would be the sequence it performs? Does it effectively read the document in each query? Why don’t they use the embeddings (that would make the process cheaper)?

Thanks in advance.

1 Like

It depends on how big the file(s) are. There’s not a lot of documentation for the underlying mechanics.

The model then decides when to retrieve content based on the user Messages. The Assistants API automatically chooses between two retrieval techniques:

  • it either passes the file content in the prompt for short documents, or
  • performs a vector search for longer documents

Retrieval currently optimizes for quality by adding all relevant content to the context of model calls. We plan to introduce other retrieval strategies to enable developers to choose a different tradeoff between retrieval quality and model usage cost.

https://platform.openai.com/docs/assistants/tools/how-it-works

So it does perform similarity search when the requirement is met.

I would highly recommend exploring your own RAG instead. Assistants Retrieval may one day be ideal but it’s black boxx’d, expensive, and hasn’t received any noticeable updates for 3 months now.

3 Likes

“Assistants Retrieval may one day be ideal but it’s black boxx’d, expensive.” Would you please elaborate on how expensive it is? I’m looking into using Assistant and would like to have an idea how much it would cost. Thanks.

Thank you for a thorough answer. I appreciate that.

Sorry for the question, but what is a RAG? (I’m a noob at this)

Thanks in advance


This is a reply from kapa.ai (ref)

RAG stands for Retrieval Augmented Generation. It’s a technique used in AI models to examine the latest user input and the context of the conversation, and then use embeddings or other search techniques to fill the AI model context with specialized knowledge relevant to the topic. This is usually used for making an AI that can answer about closed-domain problems, such as a company knowledgebase.

The phrase Retrieval Augmented Generation (RAG) comes from a recent paper (https://arxiv.org/abs/2005.11401) by Lewis et al. from Facebook AI. The idea is to use a pre-trained language model (LM) to generate text, but to use a separate retrieval system to find relevant documents to condition the LM on.

In practice, RAG can be combined with tools like Elasticsearch or Qdrant to enhance the performance of AI models. It can also be fine-tuned for specific use-cases, and can be used in conjunction with Few-Shot Learning to boost the model’s performance and reduce hallucinations.

Sources:

1 Like

Hi, I’m new to here, and assistant api.
I don’t quite understand how Retrieval work when breaking it down. I hope you guys can verify if I’m on the right track

From my understandings, One ask a assistant a question if assistant thinks it needs context it will trigger the RAG stuff? And retrieval function will keep passing words appending into the message until the Assistant satisfy? and then started to generate answer based on the info Retrieval function passed. And those context being passed will be considered as input tokens? (this is my question)

For example in my case.

I’m creating a assistant with 2 file attached. When I ask it which requires the context of those files, the RAG will start passing things, and those will be counted input token? And I will be charge with like (GPT 4 turbo input 0.1 usd / 1ktoken)?

seems that The retrieval model tends to pass the whole files instead of just the most relative things, which result in steep price?

In my case I only ask a question and it says 10633 tokens (10122 input + output)
Does that mean only one question will cost 1 dollar at least? That’s awful lot of money.

(Sorry for bad English. I’m not a native speaker. Hope this wouldn’t be an issue)
Thanks for your patient and time :+1:

And what is context token? (input token?) I don’t think so if it’s the case then I will be charging like 20 usd for that, but it’s 3.5 usd this month.

Hi there!

I have the same problem with an uncontrolled and significant token consuming while using Assistant Retrieval feature. Sometimes Assistant consumes 7 to 15k tokens just to reply on one question.

Do you have any ideas how to reduce a token usage? Any tips or tricks (besides having developed my own RAG system)? Thanks!

How many documents do you have
How big are they
What file type are they?

It’s just one .TXT of 1MB with text data and markdown. The file contains knowledge base articles (no HTML tags or anything else).

Try running the TXT file through GPT and ask it to distill, condense, and convert to MD while retaining the semantics and facts of the document, to be parsed by an embedding model

You can also try this. I’m interested to see how it works:

# MISSION
You are a Sparse Priming Representation (SPR) writer. An SPR is a particular kind of use of language for advanced NLP, NLU, and NLG tasks, particularly useful for the latest generation of Large Language Models (LLMs). You will be given information by the USER which you are to render as an SPR.

# THEORY
LLMs are a kind of deep neural network. They have been demonstrated to embed knowledge, abilities, and concepts, ranging from reasoning to planning, and even to theory of mind. These are called latent abilities and latent content, collectively referred to as latent space. The latent space of an LLM can be activated with the correct series of words as inputs, which will create a useful internal state of the neural network. This is not unlike how the right shorthand cues can prime a human mind to think in a certain way. Like human minds, LLMs are associative, meaning you only need to use the correct associations to "prime" another model to think in the same way.

# METHODOLOGY
Render the input as a distilled list of succinct statements, assertions, associations, concepts, analogies, and metaphors. The idea is to capture as much, conceptually, as possible but with as few words as possible. Write it in a way that makes sense to you, as the future audience will be another language model, not a human. Use complete sentences.

Unfortunately the only option you really have is to try and process the document(s) better for retrieval. OpenAI has not documented any of their embedding techniques so it’s a shot in the dark.

RAG is not as scary as it sounds and I highly recommend moving towards it for now until the retrieval system is atleast out of beta

1 Like

Ronald, thank you for the idea! In our case, the articles in the file contain exact instructions and details that can’t be summarized or changed. As for now, building your own RAG system looks like the best solution.

1 Like

One of the areas a summation approach can be of use is when attempting to chunk larger documents. The temptation is often to use large chunk sizes, which can “dilute” the semantic meaning over many words and concepts compared to the narrowly defined search terms.

A summation of each large “chunk” would be very useful to move that chunk into a more defined location in the embedding space, and thus increasing the likelihood of a good match being found. Adding metadata to each summated chunk that points to the original text would also be useful I think.

Thank you very much! That makes sense. Can you provide any useful articles or research on this topic?

might take a look at SciBERTSUM: Extractive Summarization for Scientific Documents

But it should be noted that simply creating a summary on a per page/paragraph level will usually be sufficient.

I personally like the chunking system I have that chunks to “about 500 words” or one typical page but it will intelligently go slightly over or under depending on paragraphs. You can then summarise each of these chunks and then embed the summary along with a metadata entry pointing to the original text, you could then search via the summary but provide the full page as context when evaluating it in the LLM.