Discussion thread for "Foundational must read GPT/LLM papers"

Smaller chunks also mean more vectors. And each vector takes time to process.

From an information capacity perspective, if I can fit 4,000 tokens into each vector, vs. 500, I have 8x more information per vector. So if I had 100,000 such embeddings (which isn’t a huge number), I have 400,000,000 tokens of information for the LLM to shape! This is equivalent to 3,000 different 400 page books!

This is in contrast to the 50,000,000 tokens you would have (375 books). So for the information content to be equal, you would need 800,000 vectors, which is starting to get up there. I’m not sure if the speedup with correlating your shorter vectors, because they have 1/3rd the size. So if it’s quadratic (worst case) you have a 9x speedup vs. my 8x more data. So it’s pretty much a wash performance wise.

The other consideration, since I would use RRF, and keywords as the dual stream, I need bigger chunks to reduce the quantization in the keyword representations.

But another consideration is that I do NOT want mismatch between embedding chunks and keyword chunks. I want these chunks to be identical.

You could try smaller chunks with embedding, and larger chunks for keywords, but now with this mismatch, it gets weird when you compare, and try to reconcile which chunk you are going to retrieve. It creates such an imbalance algorithmically, you are going to have to get really creative to make these disparate chunk sizes on the same playing field. Because if you don’t have big chunks, your keyword leg will be crippled, and you might as well drop it and go 100% embeddings.

Your queries from the user will all be small. But this is where HyDE comes in. Especially “HyDRA-HyDE” where you are spinning so many projections off the initial query, and can really create some beefy chunks for the correlation engines to reconcile.

One thought I had about the “Million little facts” situation you have proposed, is clustering the facts based on semantics (embeddings), and creating larger bundles from these correlated facts, and create one large embedding vector for this bundle, and one keyword document from the same bundle. This will produce less overall vectors, and the semantic similarity would (hopefully) keep the AI model following the retriever on-message and coherent.

Since I am thinking the LLM acts as a filter, and filters take big things with lots of information and bandwidth, and create smaller things with less bandwidth, I’m still heavily biased towards BIG data in and little data out approach, at least philosophically, based on past experiences and intuition.

PS. In the “Million little facts” situation where you bundle. If you concatenate the text for similar vectors in the bundle, say using the smaller dimensional embedding models, all you need to do is take the average of the vectors as the new vector representation (maybe scale it to unit vector too), since they are semantically related (close spatially), and BAM, now you have a fast vector leg (smaller dimensions) and a rich keyword document. It’s the best of both worlds! :scream_cat:



Since I am thinking the LLM acts as a filter, and filters take big things with lots of information and bandwidth, and create smaller things with less bandwidth, I’m still heavily biased towards BIG data in and little data out approach, at least philosophically, based on past experiences and intuition.

I’ll have to think about this, it’s very intriguing. It’s the complete opposite of what I’ve seen - big seems to fail as attention seems to drift very painfully, especially when details really matter (eg, code generation). Perhaps some type of blending of our approaches is the way to go?

Diverse ideas FTW. Will noodle.

Cool bit of prompt engineering, and in a way apropos

I’ll post some more noisy papers in this thread that I run across. Anyone should feel free to move / repost it to the other one if they think it’s worth it. I may as well if discussion / feedback here warrants it.

In general the guiding principle, IMHO, it’d be good to make the other thread worthy of Watching for most folks.


So much cool on this twitter thread. once you start evaluating models layer by layer all sorts of fascinating things reveal themselves

below is a graph of an ablation study of sorts on a 70b platypus model. the map@3 metric is an eval metric on a particular task, and the graph is when you only use 1-N layers of the model (the head is still attached of course), the technique is called “early exit”



The application I am thinking of is Customer Service / Hybrid-RAG (upper right corner).

So here it’s taking a wall of information, and shaping it to a probable answer or action. The fine-tune is to add tone/vocabulary in the output.

Here the attention and precision requirements aren’t as high as code generation. So yeah, maybe code generation is inverted, where you take specific lines or chunks and insert them into whatever active file.


No, I think you’re on to something and I suspect there is an opportunity for blending large and small context approaches if you have the budget for it.

For example, one idea might be to do some sort of review of a smaller precise answer. Sort of like - “Is the answer here consistent with the larger context?”

1 Like

Edit: got some great papers I’d like to post, but not sure it makes sense for me to just spam by myself here. I’m trying this on discord again, this time on someone else’s channel. You can follow here - SciPhi They are doing some very cool things around RAG @curt.kennedy

Will post some more here if some other folks do so, though.

1 Like

No worries mate!
You’re more than welcome to spam us with papers :laughing:

1 Like

Not sure there’s a lot of value there, diverse perspectives ftw. Will post though if others do.

Here i suggest Textbooks Are All You Need, which again focus on the quality of the training datasets.

Here is another paper about Reinforcement Learning using AI feedback, showing the power of AI evaluation.

1 Like

Not a paper but very cool nonetheless.

This appears to be from some openai prez. Anyone have more information??

hat tip discord - SciPhi

1 Like

It is an intriguing line of thought. I had sort of expected more would be done along these lines, but Nvidia seems to captured the market. I think the problem is you need a TSMC type fab set up to do something more than POC. Also, I think the issue is that algo wise, there is so much to be done and plumbed that you need a flexible substrate like GPUs to leverage, not to mention a research community with access to the hardware.

1 Like

I feel like nobody is talking about low-power neuromorphic solutions.

Look at the human brain … how much power does it take to run our brains?

Now look at LLM’s … it takes massive amounts of power, with tons of heat generated to run them.

So imagine small, interwoven, nano-materials, operating at low power, running circles around the current most advanced LLM’s.

So here, they are creating a dense physical set of nanowires, and training it!

Very novel, and probably where the future is headed.

Here is GPT-4-Turbo’s (gpt-4-1106-preview) answer:

Neuromorphic computing has the potential to be transformative for several key reasons:

  1. Energy Efficiency:
    Neuromorphic systems are designed to be highly energy efficient, emulating the low-power operation of the human brain. Traditional computing architectures consume a significant amount of power, especially for tasks such as image or signal processing, and machine learning. Neuromorphic chips can perform these tasks using much less energy, which is critical as we move toward more sustainable computing solutions and battery-powered devices.

  2. Speed and Real-Time Processing:
    Because neuromorphic computers process information in a manner similar to how neurons operate in the brain—using spikes of electricity—they can react to changes in data in real-time. This is particularly valuable for applications requiring immediate responses, such as autonomous vehicles, sensor networks, or robotic control systems.

  3. Parallel Processing Capabilities:
    The brain’s ability to perform massively parallel processing allows for the efficient handling of complex and noisy data. Neuromorphic computing seeks to replicate this, potentially providing much faster processing capabilities for tasks that involve pattern recognition, sensory data processing, and decision-making in unstructured environments.

  4. Adaptability and Learning:
    Neuromorphic chips are designed with plasticity in mind, meaning they can learn and adapt through mechanisms similar to synaptic plasticity in biological neurons. Such systems could continue to learn from their environment even after deployment, improving their performance over time and adapting to new situations without requiring explicit reprogramming.

  5. Edge Computing:
    With the proliferation of the Internet of Things (IoT) and edge devices, there is a growing need to process data at the source rather than sending it to the cloud. Neuromorphic systems could enable efficient on-device processing, leading to quicker responses and reduced data transmission costs while maintaining user privacy.

  6. Robustness to Noise and Faults:
    The brain is remarkably robust in the face of component failure or noisy signals. Neuromorphic computing systems could inherit similar robustness, gracefully dealing with hardware failures or data corruption, which would be highly beneficial for critical applications where reliability is paramount.

  7. Potential to Reshape AI and Machine Learning:
    The approach to artificial intelligence might see a paradigm shift with neuromorphic computing. The energy efficiency and continuous learning capabilities could allow for the deployment of AI in a wider array of environments, notably where power consumption or data privacy concerns currently limit the use of cloud-based AI services.

In summary, neuromorphic computing promises to create a new class of machines that can interact with the physical world in a more intelligent and efficient manner. Such technology has the potential to overcome some of the limitations of traditional computing architectures and support the continued growth of intelligent devices and systems, potentially leading to widespread technological innovation and changes in many sectors, including healthcare, transportation, industry, and consumer electronics.


About ~12 to 20w, depending on what you’re doing, that’s really effective but I don’t think it’s outside the realm of possibility of what my be achieved in our lifetimes :laughing:

Sure, but the one they built in the paper is actually not in the nanometer scale, it’s micrometers, so its fairly simple and easy to replicate :laughing:

1 Like

Whoo hoo! Yes!
Is it fair to say bottom line is: ‘no, AGI isn’t here yet, but if you are willing to work with it, GPT-4 is capable of far more than nay-sayers realize.’?
Haven’t read the whole thing yet, tl;dr, although it does look like a must read. The key open problem here, imho, is how to get past the context size limitation.
Sure, llms are a great aid for solving little short sub-problems, but, how do you support a long-term research project? For example: ‘research lithium battery chemistry, find an open problem, and write a credible Phd thesis proposal on it.’ (note proposal, not thesis).
Key issues are how to structure long-term activity and large amounts of highly structured and relationship-rich external data uncovered along the way. I don’t think embeddings and / or rag get you very far. But maybe, again ‘if you are willing to work with it’, we can get further than many suspect.

Finally found what I was looking for. EleutherAI on discord, they have an active comment on arxiv papers community. Not very well organized, but it has a lot of chatter. Check it out if you like geeking out over these things.

1 Like

Here is the blog entry for extracting training data and PII from ChatGPT:


So basically the LLM remembers some of the exact training data. This poses unique security challenges if any of the training data contains sensitive data. I don’t know much about “The Pile”, but I gather it was scraped from the internet, and has PII and other sensitive data.

One could say, no big deal, I get it. Internet is full of this stuff. But something to think about when releasing a model to the public … the model has small exact copies of your training data, and it can be extracted. It’s not 100% compressed, lossy, and scrambled.

This could be a side-effect of over fitting (or local over fitting). But maybe the model just “burns in” exactly a portion of its training data. It’s not known, which makes it interesting.

A few ways to read this.

  1. Hmm, they really are just stochastic parrots!
  2. Wow - huge opportunities for the future of smaller models like Mistral and ORCA-2 (and o1-AI c34?), larger models are still massively overfitting data!
  3. Advances in training will enable larger models like GPT-n to ‘really’ learn deep physics from the 100x volume of video data, we don’t need to scale an order of magnitude beyond gpt4 in model size to get to the next level in model capability.
1 Like

I think this is true … but …

I remember as a math grad student, talking to my professor and head of graduate studies, who admitted to me that he basically memorized and parroted various proofs and concepts as the basic foundation of his research.

I can see this. You build a foundation by mimicking the past. But you “stand on the shoulder of giants”, tweak things, and now you have NSF grant money flooding in for your new breakthroughs.

At what point are humans not parroting things, especially the past and other foundational knowledge?

Well sure. I personally think the stochastic parrot claim is a problem primarily when people prefix it with ‘just a’.
However, I do think there is an argument that the presence of large chunks of memorized training data does (perhaps?) limit recombination, unless that same data also exists in other forms.
On the other hand, I have memorized SO MANY things over my years (‘I before E except after C or when sounded like A as in neighbor and weigh.’, 'Shall I compare thee to a summer’s day?, … etc etc). What does that say about me?


1 Like

I agree! :rofl: :rofl:

I feel like a parrot too sometimes … so saying “just a” is a bit disingenuous.

While also being parrots, I think what sets us humans apart is the massive amount of connections our brains have to other neurons, or clusters of neurons. We seem to be more networked, and not just limited to 16 hidden layers, or whatever.

This networking may explain our further emergence and advantage over current machine learning constructs.

For example, in graduate level Point-Set Topology. There was a standard problem on giving an example where the subspace topology wasn’t homeomorphic to the sub-order topology. Because I was preconditioned to lexicographic orderings in my undergrad years, I came up with a clean and clear example, which blew my professors away! They wanted the result published! :rofl:

But I got there through parroting. I just extrapolated a bit. :man_shrugging:

I feel like we are all parrots :parrot: :parrot: :parrot:

1 Like