I feel like for longer (agentic) process flows where it is not just about answering a single specific query it could be an interesting approach to loop in the human and avoid surprises in the final response/output.
Ideally your system should be able to identify when it needs human assistance.
Iām thinking it should be possible to identify whether your rag results are focused enough or whether you need further refinement in your query
This is the approach I am taking for now
Thatās what I call āvector dilutionā: when your embedded chunks contain more than one precise idea (what I call atomic idea, because it doesnāt make sense to break it further down into smaller pieces) their space position is kind of āblurryā or ādilutedā across the ideas it contains.
So when you run the retrieval query, which in most cases contains only one idea in it and is more precise, it is hard to get the high similarly match with ādilutedā vector of the chunks youāre looking for.
When your corpus is small it is not easy to diagnose. But when the amount of data grows, you end up with tons of chunks that are spread all over the place in your rankings and if your cap on how many chunks you should include into the prompt is small (which it should be), you miss the important info.
Ideally, your chunks should contain only one idea at a time and a mean to trace them to their source to be able to pull more context into the prompt if needed (because atomic idea chunks often are not enough to answer complex questions). So your retrieval becomes multi-step: find chunks that match the query, pull more context from the chunk sources references, build the prompt - and only then answer.
But itās easier to say than do (unless you have a robust chunking tool, see some of my previous messages on the subject).
But you can still work even in this situation.
Here are some ideas:
Donāt cut the number of results based on similarity to the query, but rather on their āusefulnessā: increase the number of chunks you pull out. In order to make them fit into your prompt, instead of pushing all results there- select those that either contain the answer or the additional information that helps to improve the answer. This way youāll trim your results in an extra step but youāll improve the quality of the answer.
Once the results are pulled out, run them in parallel against a model trained to evaluate their āusefulnessā. Then select only the ones that passed the test (again, I donāt have the link handy, but you can search in my messages on the forum to get more details).
But having little info about your app and the data structure schema, itās hard to come up with something truly useful.
Donāt take offense please, but to me it sounds like you probably missed some important details in designing your data structures what causes your vectors not reflect enough the info youāre trying to find. Whatās the goal of the retrieval and how your chunks look like?
Just a follow-up. In addition to implementing my own Semantic Chunking strategy: Using gpt-4 API to Semantically Chunk Documents - #166 by SomebodySysop as well as Small-to-Big chunk retrieval (for better chunk context): Advanced RAG 01: Small-to-Big Retrieval | by Sophia Yang, Ph.D. | Towards Data Science
I also deployed your āDeep Diveā strategy. Essentially, I take the top 50 (or even 100) cosine similarity search results, and rate each chunk based upon itās relationship to the actual question asked. I do one chunk at a time, which ensures the best model response. I then return the highest rated chunks together as context to the model for a complete answer.
Using OpenAIās new text-embedding-3-large embed model.
Not only is this process faster than I thought it was going to be (since each API call only returns a single rating number, in my case 0-10), but also far less expensive than I imagined (especially with the new gpt-4o-mini and gemini-1.5-flash models).
This works amazingly well. I actually thought I had conceived the concept of āDeep Diveā, but looks like you beat me to it! Anyway, thank you so much for this contribution. it turned out to be the key to my issues with getting ācomprehensiveā responses.
Out of curiosity, can I ask how specifically you evaluate the relationship?
My system is very user-defined. So, I kind of play around to develop a System Message which the model will use, along with the original query, to give each content chunk a ranking.
Based on trial and error, I come up with a ranking level threshold which has the best chance of selecting the most relevant chunks.
Note that this System Message could be different depending upon the general type of documents being processed. Your provided example(s) should reflect this.
This is what I was thinking as an alternative with regards to using mamba2 or better. I feel future systems will not just be MoE, but also have interoperability between the transformer architecture and SSM archtectures.
There needs to be more programmers like you figuring out these problems so that we can get past AGI soon///
Why zero to ten ? 10 has 2 digits and kind of falls out of the row 0-9.
No logical reason. Habit? As usual, you bring up an excellent point. Thanks!
Way easier to work with static number of digits. You never know where the project will end up.
Just in case you think no one is paying attention when you provide these very detailed posts:
- Reasoning Behind āComprehension Levelā, āFilter Resultsā and āDeep Diveā Strategies.
- RAG is failing when the number of documents increase - #24 by sergeliatko
- Ideally, your chunks should contain only one idea at a time and a mean to trace them to their source to be able to pull more context into the prompt if needed (because atomic idea chunks often are not enough to answer complex questions). So your retrieval becomes multi-step: find chunks that match the query, pull more context from the chunk sources references, build the prompt - and only then answer.
- Hierarchal | Semantic Chunking (Using gpt-4 API to Semantically Chunk Documents - #166 by SomebodySysop) is designed to create āatomic ideaā chunks.
- Comprehension Level (RAG is not really a solution - #97 by SomebodySysop) retrieves chunks adjacent (within a set radius) to the key chunk (the chunk identified by the query) in order to provide more context to the key chunk.
- This idea was initially inspired by this post: Retrieving āAdjacentā Chunks for Better Context - Support - Weaviate Community Forum
- Donāt cut the number of results based on similarity to the query, but rather on their āusefulnessā: increase the number of chunks you pull out. In order to make them fit into your prompt, instead of pushing all results there- select those that either contain the answer or the additional information that helps to improve the answer. This way youāll trim your results in an extra step but youāll improve the quality of the answer.
Once the results are pulled out, run them in parallel against a model trained to evaluate their āusefulnessā. Then select only the ones that passed the test- Deep Dive (RAG is failing when the number of documents increase - #26 by SomebodySysop) automatically increases the number of results returned. However,
- Filter Results (Using gpt-4 API to Semantically Chunk Documents - #172 by SomebodySysop) filters the returned chunks based upon their relevance to the question.
- The result of these two above are answers that are more comprehensive and which arenāt diluted by unnecessary noise. You may expand your search to 100 or 200 chunks, but each chunk is examined to determine itās relevance to the actual question. So, in reality, you end up returning only a dozen or less highly relevant chunks to the model, thus dramatically decreasing your prompt size but increasing the quality of your responses.
This combination of tools have been working fabulously. Thank you, again, for your contributions!
Thanks for the feedback. Pleasure to read that. Yes, sometimes Iām a bit too wordy, but the subject is kind of tricky one. Just crazy as all of us went from "how do I complete the phrase?ā to data driven RAG workflows optimisations in a couple of yearsā¦