For discussion, let’s use Discussion thread for "Foundational must read GPT/LLM papers" - #9 by Foxabilo

This thread is for links to papers only (OP context around links is fine, and in fact encourged), so we have high signal to noise ratio.

I’ll reply there.

1 Like

Could you explain the MTEB leaderboard for the uninitiated?

I always defaulted to ada by openai, but then I discovered that there was this competition (the mteb leaderboard) to make more and better embeddings. Some of the top scoring ones on that LB are hyper powerful and will probably get even more so.

Embeddings themselves are very very powerful and are probably revolutionizing things faster than the mega GPTs. Anyone can use them, just encode some query text and sklearn.metrics.pairwise.cosine_similarity(matrix,encoded_query) where matrix is a stack of vectors representing encoded bits of text of your knowledge base. Highest cosine distance wins.

What’s particularly intriguing about embeddings is their uncanny ability to find semantic similarity. That is, to almost reason that two things are talking about the same thing.

Really wild stuff.

Facebook released faiss which facilitates a lot of this. commoditizing their compliments I suspect.

I sometimes visit the leaderboard. But while it’s cool to see small powerful models, the one issue I have is they all have a 512 token limitation.

Here you can see ada-002 with it’s massive 8k token allowance:

But you make a good point in that the dimensions these models produce is much smaller, and even at 1024 dimensions, you still see significant search speedup compared to ada’s 1536 dimensions.

The reason why I like the larger amount of tokens, is that I want to embed large chunks, say 2k-4k tokens each. This helps with keyword search too because the rarity index you end up creating on the chunk, winds up being stats on the words inside that chunk. So the bigger the chunk, the more significant the stats.

Plus, with smaller chunks, your RAG gets scatterbrained and non-coherent. So GO BIG OR GO HOME :rofl:

The goal is push big coherent chunks through the model, and the model essentially acts as a filter to produce the output. So BIGsmall in this filtering operation.

I agree the topic is vast, but HyDE is so damn easy and powerful as a keyword generator. So I’m looking at the low hanging fruit.

The higher hanging stuff, is completely get rid of vector databases, and all these search algorithms, and have your own personal, continuously adjusted, AI model that essentially generates your content on the fly from each query. That’s probably where a lot of folks will want to go.

My only concern with that approach is that the AI weights are compressing your information, so unless you run massive models, you will likely notice compression loss artifacts. But this is for larger data sets.

If you have a small business, or small collection of facts, the 100% AI based retriever might be the way to go. So it’s a Moores Law waiting game, I suppose, for the larger data set to be widely available. But lots of folks could probably get by with the small version right now … would be curious to see how these systems perform. :face_with_monocle:

3 Likes

I’ve found smaller chunks mean greater precision, and allows me to blend my various retrievers with finer granularity. bm25/tfidf (which aren’t limited, token wise) I find compete with dpr can when the text gets longer. Semantic meaning gets flattened and there are more opportunities for alignment along lexical lines.

Perhaps if my query was particularly long, but I haven’t run into a situation yet where it needs to be longer than at most a couple of 100 tokens. I’m sure there are use cases, but it’s never been an issue that I’ve encountered or familiar with.

The higher hanging stuff, is completely get rid of vector databases, and all these search algorithms, and have your own personal, continuously adjusted, AI model that essentially generates your content on the fly from each query. That’s probably where a lot of folks will want to go.

Or they want to go in the other direction :slight_smile: I think there is something to be said for human generated content, after all - that’s why all of this works, right?

Tools like GPT4 which can synthesize it are extremely useful, but it’s my feeling they need to start citing their sources. There is a moral obligation that people do it, I’m not sure that AI should get a free pass. I suspect when gemini is released it will move fairly hard in this direction.

2 Likes

Personally I find ada’s long token length better for overlapping chunks on large documents, legal texts, company reports etc, I’m able to include more of each chunk as overlap and catch more semantic relevance for that normally lost at the chunk boundaries.

2 Likes

Smaller chunks also mean more vectors. And each vector takes time to process.

From an information capacity perspective, if I can fit 4,000 tokens into each vector, vs. 500, I have 8x more information per vector. So if I had 100,000 such embeddings (which isn’t a huge number), I have 400,000,000 tokens of information for the LLM to shape! This is equivalent to 3,000 different 400 page books!

This is in contrast to the 50,000,000 tokens you would have (375 books). So for the information content to be equal, you would need 800,000 vectors, which is starting to get up there. I’m not sure if the speedup with correlating your shorter vectors, because they have 1/3rd the size. So if it’s quadratic (worst case) you have a 9x speedup vs. my 8x more data. So it’s pretty much a wash performance wise.

The other consideration, since I would use RRF, and keywords as the dual stream, I need bigger chunks to reduce the quantization in the keyword representations.

But another consideration is that I do NOT want mismatch between embedding chunks and keyword chunks. I want these chunks to be identical.

You could try smaller chunks with embedding, and larger chunks for keywords, but now with this mismatch, it gets weird when you compare, and try to reconcile which chunk you are going to retrieve. It creates such an imbalance algorithmically, you are going to have to get really creative to make these disparate chunk sizes on the same playing field. Because if you don’t have big chunks, your keyword leg will be crippled, and you might as well drop it and go 100% embeddings.

Your queries from the user will all be small. But this is where HyDE comes in. Especially “HyDRA-HyDE” where you are spinning so many projections off the initial query, and can really create some beefy chunks for the correlation engines to reconcile.

One thought I had about the “Million little facts” situation you have proposed, is clustering the facts based on semantics (embeddings), and creating larger bundles from these correlated facts, and create one large embedding vector for this bundle, and one keyword document from the same bundle. This will produce less overall vectors, and the semantic similarity would (hopefully) keep the AI model following the retriever on-message and coherent.

Since I am thinking the LLM acts as a filter, and filters take big things with lots of information and bandwidth, and create smaller things with less bandwidth, I’m still heavily biased towards BIG data in and little data out approach, at least philosophically, based on past experiences and intuition.

PS. In the “Million little facts” situation where you bundle. If you concatenate the text for similar vectors in the bundle, say using the smaller dimensional embedding models, all you need to do is take the average of the vectors as the new vector representation (maybe scale it to unit vector too), since they are semantically related (close spatially), and BAM, now you have a fast vector leg (smaller dimensions) and a rich keyword document. It’s the best of both worlds! :scream_cat:

2 Likes

@curt.kennedy

Since I am thinking the LLM acts as a filter, and filters take big things with lots of information and bandwidth, and create smaller things with less bandwidth, I’m still heavily biased towards BIG data in and little data out approach, at least philosophically, based on past experiences and intuition.

I’ll have to think about this, it’s very intriguing. It’s the complete opposite of what I’ve seen - big seems to fail as attention seems to drift very painfully, especially when details really matter (eg, code generation). Perhaps some type of blending of our approaches is the way to go?

Diverse ideas FTW. Will noodle.

Cool bit of prompt engineering, and in a way apropos

I’ll post some more noisy papers in this thread that I run across. Anyone should feel free to move / repost it to the other one if they think it’s worth it. I may as well if discussion / feedback here warrants it.

In general the guiding principle, IMHO, it’d be good to make the other thread worthy of Watching for most folks.

2 Likes

So much cool on this twitter thread. once you start evaluating models layer by layer all sorts of fascinating things reveal themselves

below is a graph of an ablation study of sorts on a 70b platypus model. the map@3 metric is an eval metric on a particular task, and the graph is when you only use 1-N layers of the model (the head is still attached of course), the technique is called “early exit”

image

2 Likes

The application I am thinking of is Customer Service / Hybrid-RAG (upper right corner).

So here it’s taking a wall of information, and shaping it to a probable answer or action. The fine-tune is to add tone/vocabulary in the output.

Here the attention and precision requirements aren’t as high as code generation. So yeah, maybe code generation is inverted, where you take specific lines or chunks and insert them into whatever active file.

2 Likes

No, I think you’re on to something and I suspect there is an opportunity for blending large and small context approaches if you have the budget for it.

For example, one idea might be to do some sort of review of a smaller precise answer. Sort of like - “Is the answer here consistent with the larger context?”

1 Like

Edit: got some great papers I’d like to post, but not sure it makes sense for me to just spam by myself here. I’m trying this on discord again, this time on someone else’s channel. You can follow here - SciPhi They are doing some very cool things around RAG @curt.kennedy

Will post some more here if some other folks do so, though.

1 Like

No worries mate!
You’re more than welcome to spam us with papers :laughing:
gif_spam

1 Like

Not sure there’s a lot of value there, diverse perspectives ftw. Will post though if others do.

Here i suggest Textbooks Are All You Need, which again focus on the quality of the training datasets.

Here is another paper about Reinforcement Learning using AI feedback, showing the power of AI evaluation.

Not a paper but very cool nonetheless.

This appears to be from some openai prez. Anyone have more information??

hat tip discord - SciPhi

1 Like

It is an intriguing line of thought. I had sort of expected more would be done along these lines, but Nvidia seems to captured the market. I think the problem is you need a TSMC type fab set up to do something more than POC. Also, I think the issue is that algo wise, there is so much to be done and plumbed that you need a flexible substrate like GPUs to leverage, not to mention a research community with access to the hardware.

1 Like

I feel like nobody is talking about low-power neuromorphic solutions.

Look at the human brain … how much power does it take to run our brains?

Now look at LLM’s … it takes massive amounts of power, with tons of heat generated to run them.

So imagine small, interwoven, nano-materials, operating at low power, running circles around the current most advanced LLM’s.

So here, they are creating a dense physical set of nanowires, and training it!

Very novel, and probably where the future is headed.

Here is GPT-4-Turbo’s (gpt-4-1106-preview) answer:

Neuromorphic computing has the potential to be transformative for several key reasons:

  1. Energy Efficiency:
    Neuromorphic systems are designed to be highly energy efficient, emulating the low-power operation of the human brain. Traditional computing architectures consume a significant amount of power, especially for tasks such as image or signal processing, and machine learning. Neuromorphic chips can perform these tasks using much less energy, which is critical as we move toward more sustainable computing solutions and battery-powered devices.

  2. Speed and Real-Time Processing:
    Because neuromorphic computers process information in a manner similar to how neurons operate in the brain—using spikes of electricity—they can react to changes in data in real-time. This is particularly valuable for applications requiring immediate responses, such as autonomous vehicles, sensor networks, or robotic control systems.

  3. Parallel Processing Capabilities:
    The brain’s ability to perform massively parallel processing allows for the efficient handling of complex and noisy data. Neuromorphic computing seeks to replicate this, potentially providing much faster processing capabilities for tasks that involve pattern recognition, sensory data processing, and decision-making in unstructured environments.

  4. Adaptability and Learning:
    Neuromorphic chips are designed with plasticity in mind, meaning they can learn and adapt through mechanisms similar to synaptic plasticity in biological neurons. Such systems could continue to learn from their environment even after deployment, improving their performance over time and adapting to new situations without requiring explicit reprogramming.

  5. Edge Computing:
    With the proliferation of the Internet of Things (IoT) and edge devices, there is a growing need to process data at the source rather than sending it to the cloud. Neuromorphic systems could enable efficient on-device processing, leading to quicker responses and reduced data transmission costs while maintaining user privacy.

  6. Robustness to Noise and Faults:
    The brain is remarkably robust in the face of component failure or noisy signals. Neuromorphic computing systems could inherit similar robustness, gracefully dealing with hardware failures or data corruption, which would be highly beneficial for critical applications where reliability is paramount.

  7. Potential to Reshape AI and Machine Learning:
    The approach to artificial intelligence might see a paradigm shift with neuromorphic computing. The energy efficiency and continuous learning capabilities could allow for the deployment of AI in a wider array of environments, notably where power consumption or data privacy concerns currently limit the use of cloud-based AI services.

In summary, neuromorphic computing promises to create a new class of machines that can interact with the physical world in a more intelligent and efficient manner. Such technology has the potential to overcome some of the limitations of traditional computing architectures and support the continued growth of intelligent devices and systems, potentially leading to widespread technological innovation and changes in many sectors, including healthcare, transportation, industry, and consumer electronics.

2 Likes

About ~12 to 20w, depending on what you’re doing, that’s really effective but I don’t think it’s outside the realm of possibility of what my be achieved in our lifetimes :laughing:

Sure, but the one they built in the paper is actually not in the nanometer scale, it’s micrometers, so its fairly simple and easy to replicate :laughing:

1 Like

Whoo hoo! Yes!
Is it fair to say bottom line is: ‘no, AGI isn’t here yet, but if you are willing to work with it, GPT-4 is capable of far more than nay-sayers realize.’?
Haven’t read the whole thing yet, tl;dr, although it does look like a must read. The key open problem here, imho, is how to get past the context size limitation.
Sure, llms are a great aid for solving little short sub-problems, but, how do you support a long-term research project? For example: ‘research lithium battery chemistry, find an open problem, and write a credible Phd thesis proposal on it.’ (note proposal, not thesis).
Key issues are how to structure long-term activity and large amounts of highly structured and relationship-rich external data uncovered along the way. I don’t think embeddings and / or rag get you very far. But maybe, again ‘if you are willing to work with it’, we can get further than many suspect.

Finally found what I was looking for. EleutherAI on discord, they have an active comment on arxiv papers community. Not very well organized, but it has a lot of chatter. Check it out if you like geeking out over these things.

1 Like