Foundational must read GPT/LLM papers

Recently have engaged in some RAG / retrieval augmented generation.

[2307.03172] Lost in the Middle: How Language Models Use Long Contexts
[2004.04906] Dense Passage Retrieval for Open-Domain Question Answering
[2309.09117] Contrastive Decoding Improves Reasoning in Large Language Models
[2209.10063] Generate rather than Retrieve: Large Language Models are Strong Context Generators
[2304.14856] A Unified Generative Retriever for Knowledge-Intensive Language Tasks via Prompt Learning
Curious idea - [2212.02027] Retrieval as Attention: End-to-end Learning of Retrieval and Reading within a Single Transformer
[2304.14856] A Unified Generative Retriever for Knowledge-Intensive Language Tasks via Prompt Learning
[2004.12832] ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
Combining Embedding and Keyword Based Search for Improved Performance | by Zachariah Zhang | Medium
[2308.14963] Vector Search with OpenAI Embeddings: Lucene Is All You Need
[2212.09146] Can Retriever-Augmented Language Models Reason? The Blame Game Between the Retriever and the Language Model
[2305.15294] Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy
[2205.01230] Retrieval-Enhanced Machine Learning

One big takeway was the power of blending bm25, tfidf, and dpr / embedding based retrieval. Different strategies can be used, such as reciprocal rank fusion. One always important task, though inevitably challenging, is evaluation criteria for potential training of your retriever.

Another is that by precomputing such things as tfidf/sentence embeddings, you can achieve significant speedups. See the colbert paper above for other approaches to this.

Two papers above, while not particularly ‘foundational’ I think capture some key constraints of RAG quite well. " Can Retriever-Augmented Language Models Reason? The Blame Game Between the Retriever and the Language Model" and the curious paper - " Retrieval as Attention: End-to-end Learning of Retrieval and Reading within a Single Transformer". The former becomes quickly apparent there is a clear tension between the two main architectural components, and the latter as a potential and novel resolution helps bring color to the discussion.


4 posts were merged into an existing topic: Discussion thread for “Foundational must read GPT/LLM papers”

Amazing read! I am trying out the prompt examples myself as explained in the experiment and it gives slightly different answers but defenitly proves their point that Chatgpt 4 is thinking or it looks like it is thinking like a human. Very impressive indeed! Thanks for sharing.

Well cited paper. A lot of assumptions are made about reduced, high quality datasets being superior based on this paper. I haven’t seen much in the way of follow up evidence however. One possibility is that this paper got lucky overfit on the evals.

I played around with the phi 1.5 llm, and tbh, it didn’t seem particularly superior. Intuitively the concept makes sense - high quality training data leads to higher quality output, but I’m a bit surprised we haven’t seen much more along these lines.

There was this paper in the reverse google cite, but it doesn’t really state superiority, more just a method for instruction mining.

It’s an interesting problem. Assuming there is a way to evaluate good data, the next step would be synthetic RAG of sorts used to train an LLM. Some method of back propagating that would be useful, though not sure how to do it through a retriever.

Something like this:

1 Like

I see a paper where it says LLM can do time series forecasting. Is it really possible based on the architecture on which it has been built? The paper is here

1 Like

“Foundational?” maybe not, but I suspect this obvious (post ante) technique for fine-tuning will be universally adopted as the word spreads.


Interesting paper from Google, talks about challenges in generalizing to tasks outside pretrained data.

That said:

“Building on previous work, we investigate this question in a controlled
setting, where we study transformer models trained on sequences of (x, f(x)) pairs rather than natural language.”

The question becomes is there intelligence embedded in word embeddings which LLMs are somehow utilizing to better generalize.

From the conclusion:

An important question is understanding how the observations we make here carry over to tokenized models and to questions represented in natural language. We attempted an experiment to
train a tokenized model for the one-dimensional examples presented in Section 4 by binning the
scalar values into buckets, and treating the bucket indices as tokens for the input to a transformerbased language model. We trained this model for 5M epochs with a cross-entropy loss as typically
used in language models, but were unable to significantly decrease the loss. Understanding the
challenges to training such a model and evaluating whether this framing has different model selection or out-of-distribution generalization properties is important future work.

1 Like

Online dynamical learning and sequence memory with neuromorphic nanowire networks

Nanowire Networks (NWNs) belong to an emerging class of neuromorphic systems that exploit the unique physical properties of nanostructured materials. In addition to their neural network-like physical structure, NWNs also exhibit resistive memory switching in response to electrical inputs due to synapse-like changes in conductance at nanowire-nanowire cross-point junctions. Previous studies have demonstrated how the neuromorphic dynamics generated by NWNs can be harnessed for temporal learning tasks. This study extends these findings further by demonstrating online learning from spatiotemporal dynamical features using image classification and sequence memory recall tasks implemented on an NWN device. Applied to the MNIST handwritten digit classification task, online dynamical learning with the NWN device achieves an overall accuracy of 93.4%. Additionally, we find a correlation between the classification accuracy of individual digit classes and mutual information. The sequence memory task reveals how memory patterns embedded in the dynamical features enable online learning and recall of a spatiotemporal sequence pattern. Overall, these results provide proof-of-concept of online learning from spatiotemporal dynamics using NWNs and further elucidate how memory can enhance learning.


Very very exciting stuff … if you can ignore the not so subtle fear mongering (covid, meth … fun.)

1 Like

search youtube:
llm summaries echohive
for arxiv LLM papers summarized every hour

1 Like

Fun weekend. But life goes on …

I think this follows from the stochastic parrot camp. LLMs can generate correct outputs easily enough (they’ve been trained on them) but reasoning capability required for actually finding mistakes is perhaps another level.

1 Like

I guess this now can be somewhat foundational.

Nonetheless, one major effect of ChatGPT’s release was to spark a sense of urgency inside major tech companies.149 To avoid falling behind OpenAI amid the wave of customer enthusiasm about chatbots, competitors sought to accelerate or circumvent internal safety and ethics review processes, with Google creating a fast-track “green lane” to allow products to be released more quickly…

A different approach to signaling in the private sector comes from Anthropic, one of OpenAI’s primary competitors. Anthropic’s desire to be perceived as a company that values safety shines through across its communications, beginning from its tagline: “an AI safety and research company.”

I can see why that would upset Sam, though not sure it was worth trying to boot Toner over it. Certainly considering the outcome. Too bad they couldn’t find a mediator type board member to join.

Thread won’t let me post more, soooo:

Leaderboard here - GAIA Leaderboard - a Hugging Face Space by gaia-benchmark

Top score is 14% for GPT versus human 92%

1 Like

Extracting training data from LLM’s: by exploiting model divergence. Kinda scary.


I don’t see a paper out yet, my Microsoft announced Phi-2 yesterday,

For a 2.7B-parameter model it gets some very impressive results in synthetic benchmarks.

Between Phi-2, Magicoder, and Mixtral, I think the groundwork has been laid to see a pretty shockingly good coding model capable of being run on a beefy consumer card.

Essentially, I am imagining a foundation model like Phi-2 being fine-tuned for instruction and code as was done in Magicoder, but used as a core model in a mixture-of-experts model a la Mixtral. Basically, “PhiCoder 8x3B.” Depending on the number of active parameters during inference, the proportion of shared parameters, and the quantization level such a model could be squeezed into an RTX (3/4)090.

I suspect (hope?) at the rate things are progressing, we’ll see open source models fitting into 24GB VRAM out-performing GPT-4 in coding tasks sometime during 2024.



1 Like

We shall brow beat the ASI into submission via cutesy graphics!


Great, but … details? :stuck_out_tongue:

I can just imagine the convo:

“we need a plan to make AI safe”

“how about something that tells us when AI isn’t safe”

“great idea!”

Still, a good start. Keep going!!

There was a recent paper from DeepMind.

Couple of interesting rebuttals:

and more detailed:

In truth, I found the paper OP very interesting. Not so much for what it did, but how it didn’t accomplish its goals. Some of the greatest minds in the world are working on this problem - solving novel and compelling math problems via AI - and the above is how they ended up doing it. That to me is very interesting news.

From Sparse to Dense:
GPT-4 Summarization with Chain of Density Prompting

wish they said something about how they did entity identification.


Looks like a gamechanger for RLHF and possibly super intelligence alignment


Hi I’m maintaining a table of papers related to LLM agent
Please give me your ideas and other interesting papers

I’m strongly interested in how human intelligence can be simulated with LLM

Target fields: Robotics, Reasoning, Agent, Reinforcement learning, Prompt engineering, CoT, ICL, Multimodal LLM(LiDAR), Instruction turning, VQA, Data generation, Driving, Feedback, VLM, PEFT, RLHF

1 Like