A good way to stay up to date on papers that are of interest to you. Find some papers that are very well cited that talk about your particular topic, and use reverse citation on google scholar and then sort by date.

Eg, I’m interested in RAG and this paper is one of a few is that is very well known and would likely get cited by something I’m interested in.

https://scholar.google.ca/scholar?hl=en&as_sdt=2005&sciodt=0,5&cites=10679876450978666441&scipsc=&q=&scisbd=1

I have quite a few that I cycle through, as folks don’t always remember to cite these root papers.

2 Likes

@curt.kennedy

The latest reference architecture I have floating in my head is basically use a blend of embeddings and keywords, with extensive use of HyDE to steer the query.

Yeah, getting llm to generate keywords is hugely important. But hyde is just a very narrow start of that, imho. The topic around that I believe is much vaster.

Dense:

  • Embeddings, take your pick, I would just use ada-002

I used to think so to, until someone introduced me to MTEB Leaderboard - a Hugging Face Space by mteb and then everything changed dramatically overnight. In particular, smalller embedding models are very fast and surprisingly powerful.

Sparse:

I’ve messed around a lot here, and I think there’s stuff to do, but I 've realized that my first step is to master the standards (straight bm25/tfidf/semantic embeddings) before engaging at the edges. we’re all at different stages in different parts.

Deepen the search with what I call “HyDE projections” (HyDRA-HyDE ??? :rofl:)

  • Let’s say you have 5 different common views of a subject, ask the LLM to generate answers from these 5 perspectives (HyDE), so re-frame the question from these additional perspectives. This re-framing, is all you really need, I think, over a fine-tune, because you are reshaping the data to align to the query by this steering. So a lot of your papers mention fine-tuning as the answer. But I think re-framing from a fixed set of perspectives that you define can be just as powerful. If your subject domain is super rare and unknown by the model, then maybe in that case you need a fine-tune.

Yeah, the possibilities here are near infinite and much will be written about this topic I’m sure by many very smart people, though tricky to conclude because of lack of explainability and ‘prompt engineering’. I am guessing gemini will do a lot here, but that’s just a hunch.

I don’t think the papers i quoted stress fine tuning, except maybe a couple. There’s a lot to be done around training / i guess fine tuning the retriever (eg, contrastive learning), but imho that’s more pre training / transfer learning than ‘fine tuning’ which is frequently used in the context of LLMs. Even then, training retrievers is tricky and I have yet to discern the mysteries.

So in this scenario, you take the original query, and generate the 5 other queries (5+1), and so you have 6 different pulls

  • 6 embedding (dense) pulls
  • 6 keyword (sparse) pulls

So you have 12 streams to reconcile, and you just use RRF to do this.

Yes, we are as one on this part. Diverse retrievers together are greater than the sum of the parts

Each stream can be weighted differently by increasing the K factor in the denominator of RRF.

I’m looking for papers talking about the different things to be done here in particular if you run across any.

3 Likes

Thank you for this philosophical beauty:

And thank you for sharing this:

Just spent the last 30 minutes getting lost while browsing the leaderboard, very interesting! they even have one for Danish bitex mining! (that includes weird dialects that GPT doesn’t understand)

Could you explain the MTEB leaderboard for the uninitiated?

1 Like

The latest reference architecture I have floating in my head is basically use a blend of embeddings and keywords, with extensive use of HyDE to steer the query.

Dense:

  • Embeddings, take your pick, I would just use ada-002

Sparse:

Deepen the search with what I call “HyDE projections” (HyDRA-HyDE ??? :rofl:)

  • Let’s say you have 5 different common views of a subject, ask the LLM to generate answers from these 5 perspectives (HyDE), so re-frame the question from these additional perspectives. This re-framing, is all you really need, I think, over a fine-tune, because you are reshaping the data to align to the query by this steering. So a lot of your papers mention fine-tuning as the answer. But I think re-framing from a fixed set of perspectives that you define can be just as powerful. If your subject domain is super rare and unknown by the model, then maybe in that case you need a fine-tune.

So in this scenario, you take the original query, and generate the 5 other queries (5+1), and so you have 6 different pulls

  • 6 embedding (dense) pulls
  • 6 keyword (sparse) pulls

So you have 12 streams to reconcile, and you just use RRF to do this.

Each stream can be weighted differently by increasing the K factor in the denominator of RRF.

1 Like

I recommend using a chrome extension called “Sprint Reader” to read papers like this.

Its an RSVP tool.
And no its not my extension, but it is useful.
When you get good at using it, you can read entire books in 10 minutes.

So mentioned because I think its an interesting read.

#Edit: Be careful how high you set the WPM.
If it exceeds your monitors refresh sync, it will skip words.
So it might seem like 3,600 Wpm is stable, but- /shrug

For discussion, let’s use Discussion thread for "Foundational must read GPT/LLM papers" - #9 by Foxabilo

This thread is for links to papers only (OP context around links is fine, and in fact encourged), so we have high signal to noise ratio.

I’ll reply there.

1 Like

Could you explain the MTEB leaderboard for the uninitiated?

I always defaulted to ada by openai, but then I discovered that there was this competition (the mteb leaderboard) to make more and better embeddings. Some of the top scoring ones on that LB are hyper powerful and will probably get even more so.

Embeddings themselves are very very powerful and are probably revolutionizing things faster than the mega GPTs. Anyone can use them, just encode some query text and sklearn.metrics.pairwise.cosine_similarity(matrix,encoded_query) where matrix is a stack of vectors representing encoded bits of text of your knowledge base. Highest cosine distance wins.

What’s particularly intriguing about embeddings is their uncanny ability to find semantic similarity. That is, to almost reason that two things are talking about the same thing.

Really wild stuff.

Facebook released faiss which facilitates a lot of this. commoditizing their compliments I suspect.

I sometimes visit the leaderboard. But while it’s cool to see small powerful models, the one issue I have is they all have a 512 token limitation.

Here you can see ada-002 with it’s massive 8k token allowance:

But you make a good point in that the dimensions these models produce is much smaller, and even at 1024 dimensions, you still see significant search speedup compared to ada’s 1536 dimensions.

The reason why I like the larger amount of tokens, is that I want to embed large chunks, say 2k-4k tokens each. This helps with keyword search too because the rarity index you end up creating on the chunk, winds up being stats on the words inside that chunk. So the bigger the chunk, the more significant the stats.

Plus, with smaller chunks, your RAG gets scatterbrained and non-coherent. So GO BIG OR GO HOME :rofl:

The goal is push big coherent chunks through the model, and the model essentially acts as a filter to produce the output. So BIGsmall in this filtering operation.

I agree the topic is vast, but HyDE is so damn easy and powerful as a keyword generator. So I’m looking at the low hanging fruit.

The higher hanging stuff, is completely get rid of vector databases, and all these search algorithms, and have your own personal, continuously adjusted, AI model that essentially generates your content on the fly from each query. That’s probably where a lot of folks will want to go.

My only concern with that approach is that the AI weights are compressing your information, so unless you run massive models, you will likely notice compression loss artifacts. But this is for larger data sets.

If you have a small business, or small collection of facts, the 100% AI based retriever might be the way to go. So it’s a Moores Law waiting game, I suppose, for the larger data set to be widely available. But lots of folks could probably get by with the small version right now … would be curious to see how these systems perform. :face_with_monocle:

3 Likes

I’ve found smaller chunks mean greater precision, and allows me to blend my various retrievers with finer granularity. bm25/tfidf (which aren’t limited, token wise) I find compete with dpr can when the text gets longer. Semantic meaning gets flattened and there are more opportunities for alignment along lexical lines.

Perhaps if my query was particularly long, but I haven’t run into a situation yet where it needs to be longer than at most a couple of 100 tokens. I’m sure there are use cases, but it’s never been an issue that I’ve encountered or familiar with.

The higher hanging stuff, is completely get rid of vector databases, and all these search algorithms, and have your own personal, continuously adjusted, AI model that essentially generates your content on the fly from each query. That’s probably where a lot of folks will want to go.

Or they want to go in the other direction :slight_smile: I think there is something to be said for human generated content, after all - that’s why all of this works, right?

Tools like GPT4 which can synthesize it are extremely useful, but it’s my feeling they need to start citing their sources. There is a moral obligation that people do it, I’m not sure that AI should get a free pass. I suspect when gemini is released it will move fairly hard in this direction.

2 Likes

Personally I find ada’s long token length better for overlapping chunks on large documents, legal texts, company reports etc, I’m able to include more of each chunk as overlap and catch more semantic relevance for that normally lost at the chunk boundaries.

2 Likes

Smaller chunks also mean more vectors. And each vector takes time to process.

From an information capacity perspective, if I can fit 4,000 tokens into each vector, vs. 500, I have 8x more information per vector. So if I had 100,000 such embeddings (which isn’t a huge number), I have 400,000,000 tokens of information for the LLM to shape! This is equivalent to 3,000 different 400 page books!

This is in contrast to the 50,000,000 tokens you would have (375 books). So for the information content to be equal, you would need 800,000 vectors, which is starting to get up there. I’m not sure if the speedup with correlating your shorter vectors, because they have 1/3rd the size. So if it’s quadratic (worst case) you have a 9x speedup vs. my 8x more data. So it’s pretty much a wash performance wise.

The other consideration, since I would use RRF, and keywords as the dual stream, I need bigger chunks to reduce the quantization in the keyword representations.

But another consideration is that I do NOT want mismatch between embedding chunks and keyword chunks. I want these chunks to be identical.

You could try smaller chunks with embedding, and larger chunks for keywords, but now with this mismatch, it gets weird when you compare, and try to reconcile which chunk you are going to retrieve. It creates such an imbalance algorithmically, you are going to have to get really creative to make these disparate chunk sizes on the same playing field. Because if you don’t have big chunks, your keyword leg will be crippled, and you might as well drop it and go 100% embeddings.

Your queries from the user will all be small. But this is where HyDE comes in. Especially “HyDRA-HyDE” where you are spinning so many projections off the initial query, and can really create some beefy chunks for the correlation engines to reconcile.

One thought I had about the “Million little facts” situation you have proposed, is clustering the facts based on semantics (embeddings), and creating larger bundles from these correlated facts, and create one large embedding vector for this bundle, and one keyword document from the same bundle. This will produce less overall vectors, and the semantic similarity would (hopefully) keep the AI model following the retriever on-message and coherent.

Since I am thinking the LLM acts as a filter, and filters take big things with lots of information and bandwidth, and create smaller things with less bandwidth, I’m still heavily biased towards BIG data in and little data out approach, at least philosophically, based on past experiences and intuition.

PS. In the “Million little facts” situation where you bundle. If you concatenate the text for similar vectors in the bundle, say using the smaller dimensional embedding models, all you need to do is take the average of the vectors as the new vector representation (maybe scale it to unit vector too), since they are semantically related (close spatially), and BAM, now you have a fast vector leg (smaller dimensions) and a rich keyword document. It’s the best of both worlds! :scream_cat:

2 Likes

@curt.kennedy

Since I am thinking the LLM acts as a filter, and filters take big things with lots of information and bandwidth, and create smaller things with less bandwidth, I’m still heavily biased towards BIG data in and little data out approach, at least philosophically, based on past experiences and intuition.

I’ll have to think about this, it’s very intriguing. It’s the complete opposite of what I’ve seen - big seems to fail as attention seems to drift very painfully, especially when details really matter (eg, code generation). Perhaps some type of blending of our approaches is the way to go?

Diverse ideas FTW. Will noodle.

Cool bit of prompt engineering, and in a way apropos

I’ll post some more noisy papers in this thread that I run across. Anyone should feel free to move / repost it to the other one if they think it’s worth it. I may as well if discussion / feedback here warrants it.

In general the guiding principle, IMHO, it’d be good to make the other thread worthy of Watching for most folks.

2 Likes

So much cool on this twitter thread. once you start evaluating models layer by layer all sorts of fascinating things reveal themselves

below is a graph of an ablation study of sorts on a 70b platypus model. the map@3 metric is an eval metric on a particular task, and the graph is when you only use 1-N layers of the model (the head is still attached of course), the technique is called “early exit”

image

2 Likes

The application I am thinking of is Customer Service / Hybrid-RAG (upper right corner).

So here it’s taking a wall of information, and shaping it to a probable answer or action. The fine-tune is to add tone/vocabulary in the output.

Here the attention and precision requirements aren’t as high as code generation. So yeah, maybe code generation is inverted, where you take specific lines or chunks and insert them into whatever active file.

2 Likes

No, I think you’re on to something and I suspect there is an opportunity for blending large and small context approaches if you have the budget for it.

For example, one idea might be to do some sort of review of a smaller precise answer. Sort of like - “Is the answer here consistent with the larger context?”

1 Like

Edit: got some great papers I’d like to post, but not sure it makes sense for me to just spam by myself here. I’m trying this on discord again, this time on someone else’s channel. You can follow here - SciPhi They are doing some very cool things around RAG @curt.kennedy

Will post some more here if some other folks do so, though.

1 Like

No worries mate!
You’re more than welcome to spam us with papers :laughing:
gif_spam

1 Like

Not sure there’s a lot of value there, diverse perspectives ftw. Will post though if others do.

Here i suggest Textbooks Are All You Need, which again focus on the quality of the training datasets.

Here is another paper about Reinforcement Learning using AI feedback, showing the power of AI evaluation.

Not a paper but very cool nonetheless.

This appears to be from some openai prez. Anyone have more information??

hat tip discord - SciPhi

1 Like