I sometimes visit the leaderboard. But while it’s cool to see small powerful models, the one issue I have is they all have a 512 token limitation.
Here you can see ada-002 with it’s massive 8k token allowance:
But you make a good point in that the dimensions these models produce is much smaller, and even at 1024 dimensions, you still see significant search speedup compared to ada’s 1536 dimensions.
The reason why I like the larger amount of tokens, is that I want to embed large chunks, say 2k-4k tokens each. This helps with keyword search too because the rarity index you end up creating on the chunk, winds up being stats on the words inside that chunk. So the bigger the chunk, the more significant the stats.
Plus, with smaller chunks, your RAG gets scatterbrained and non-coherent. So GO BIG OR GO HOME
The goal is push big coherent chunks through the model, and the model essentially acts as a filter to produce the output. So BIG → small in this filtering operation.
I agree the topic is vast, but HyDE is so damn easy and powerful as a keyword generator. So I’m looking at the low hanging fruit.
The higher hanging stuff, is completely get rid of vector databases, and all these search algorithms, and have your own personal, continuously adjusted, AI model that essentially generates your content on the fly from each query. That’s probably where a lot of folks will want to go.
My only concern with that approach is that the AI weights are compressing your information, so unless you run massive models, you will likely notice compression loss artifacts. But this is for larger data sets.
If you have a small business, or small collection of facts, the 100% AI based retriever might be the way to go. So it’s a Moores Law waiting game, I suppose, for the larger data set to be widely available. But lots of folks could probably get by with the small version right now … would be curious to see how these systems perform.