The LlamaIndex code above just uses OpenAI and seems to make more sense to me, so I might have to dig in there, since it looks straightforward, but does involve a fine-tune (see comments/approach I have below)
The OPT-IML paper seems focused on creating a framework to evaluate performance of newly generated instruct models.
Everyone seems worried with attention with large contexts.
I get it, and I think differently about this it seems …
So assume you had various 2k-4k large contexts retrieved from your RAG, essentially crossed them with the query to get new projections out of the model, and very likely smaller, and then reconciled these smaller things in the end.
So for example:
{Big Chunk 0 from RAG} x {Query} = {Projected answer 0}
{Big Chunk 1 from RAG} x {Query} = {Projected answer 1}
…
{Big Chunk N from RAG} x {Query} = {Projected answer N}
Here “x {Query}” just means get the LLM to draw an answer (prompted) from the Big Chunk X.
Then take a final prompt:
Based on the hypothetical answers here:
{Projected answer 0}
{Projected answer 1}
...
{Projected answer N}
What is the most likely consistent, correct, and honest answer to this question:
{Query}
So all your attention is localized to 2k-4k chunks (not a crazy amount IMO) and you have some “reconciler” operation in the end. This is only for massive massive amounts of data that you want it to comb through. Otherwise you just have the one big chunk, and don’t need to reconcile.
You could also tag previously vetted true answers as [“PROCEED” given {Query}] and if the query correlates and the answer correlates, then you have a high confidence factor and can return this new answer. And later go back and see if you agree it was a good answer, and label it at “HALT” given {Query}].
Alternatively, you can just use this embedding procedure and correlation to get the previous PROCEED/HALT tags, and pick the most correlated one with a PROCEED tag without a followup “reconciler” query (saves some latency)
All this labeling, and I’ve done it, is through a simple script I run periodically on the database. I just hit 0 or 1 at the terminal for each item, and the database is instantly updated, and can be used for future queries.
I guess I believe in more of a hands-on approach before delving into the models and fine-tuning them to retrieve alternative data. I believe some would call my approach data driven more than anything.
Thoughts?