Foundational must read GPT/LLM papers

I am updating list of papers, LLM for AGI
I’m making more readable, sortable, and understandable.

A funny market on manifold :slight_smile:

image

There’s a bunch of AI folks on manifold trying to get LLMs predicting the future, which is pretty interesting. Hopefully some folks will post some more LLM papers to the market above.

Great paper, hat tip @N2U

https://arxiv.org/html/2405.00332v1

GSM1k is a new set meant to closely mimic gsm8k to help check whether models are overfitting. And guess what…

2 Likes

Thank you! :laughing:

On the same note, I find this article a slightly more humorous take on exactly the same issue:

(I’ll extended the hat-tipping to @Diet who originally alerted me to this joke-article)

Ngl, I’m considering just making my own extended version of the most common benchmarks, like they did with GSM-1k, and not publishing anything so I can actually use the results, because most of available model’s do suffer a bit from “benchmark pollution” :sweat_smile:

1 Like

Chosing which benchmarks to do this with is up for discussion though.

There’s a lot of simple benchmarks that the models do extremely well on, so it could be interesting to see if there’s any regression there when using new & unseen questions, but on the other hand these benchmarks also have very competitive leaderboards, meaning goodhart’s law apply:

What are your thoughts on this?

2 Likes

Better & Faster Large Language Models via Multi-token Prediction

2 Likes

Current multi-token run fun: use response format JSON, and extend max_tokens of your response one token at a time. Watch history be altered in logprobs.

There’s no good modern embeddings benchmark, one for information retrieval that demonstrates state-of-the-art document chunking for AI context placement on out-of-public documents typical of RAG or search workloads.

Unlike a math problem, where it is right or wrong, a model’s resultant input-to-chunk scoring is subjective, needing first an ideal ranking of thousands of chunks against every stimulus, with AI providing insights a human might not have considered. Even formulating how to send off a target for AI to meet to clickworkers as human embeddings models is a challenge, and they would have to be domain experts, not refugee camp teens.

1 Like

That sounds like a very interesting experiment, I might try that!

I’m guessing you’ve already tried, is there any interesting behavior worth noting?

Yeah this is a problem, the approach I’m currently using for benchmarking retrieval is not perfect either, currently I’m taking each chunk and generating a question that can be answered by the text within, I’m only considering the the top ranked result, and if the retrieved context isn’t the one used to generate the question, I’m counting it as “wrong” even though it may contain insights that answers the question.

It’s harder to come up with a presentation format than write the code.

Here max tokens is in (number), and each token is within backslash and forward slash, for a series of trials.

(4)
\/\
/\ /\ /\{
/
----------------------
(5)
\/\
/\ /\ /\ {
/\   /
----------------------
(6)
\/\
/\ /\ /\ {
/\   /\ "/
----------------------
(7)
\/\
/\ /\ /\ {
/\   /\ "/\name/
----------------------
(8)
\/\
/\ /\ /\{
/\   /\ "/\Chat/\G/
----------------------
(9)
\/\
/\ /\ /\ {
/\   /\ "/\name/\":/\ "/
----------------------

The generation can look back, changing name into ChatG and then into name": "

1 Like

Thank you!

That was some interesting results!

I think the takeaway is essentially that the results from any study that doesn’t list the number of max_tokens used for generation should be taken with a grain/mountain of salt :thinking:

‘Overfitted’. Hmm. I’d like to unpack this a bit.
First, Great paper IMHO.
Now, let’s drop all our preconceptions/prejudiced around that word.
Let’s just say:

‘results on a dataset of questions that may have been in the training data are higher than results on a dataset of similar questions less likely to have been in the dataset.’

That is what this measures, yes?

Now, given that my favorite (subjective rating) models for advanced reasoning tasks are Claude Opus, GPT-4, Mixtral-8x22B-Instruct (and llama3-70B-Instruct, but that isn’t here), and maybe phi-3 among small models, what should my takeaway be? As a model-trainer? As a user?

Not clear to me… when in doubt choose to be extreme? :slight_smile:
Except I wish our ‘benchmarks’ were stochastic generators, not static datasets…

1 Like

Yes, that seems to be the correct interpretation.

From a user perspective, I’d say it’s just “benchmarks aren’t a definitive measure of performance and should be taken with a grain of salt,” but from a model training perspective, I’d say that it shows that larger models are better at generalizing what they’ve learned. :laughing:

It seems to be a misapplication of nomenclature.

An overfitted model can’t infer because of its adherence to formulaic token generation, having been trained with deeper rewarding on an entailment pattern than desired.

These subject models that might have benchmark contamination are not overfitted unless they cannot deviate well from the known inputs to training. It would take comparing with and without the contamination to know if that is the case, where in fact it is likely the case that reasoning is improved, just not as much as in repeating back the exact token sequences.

A model is not overfitted just because it can recite Shakespeare yet not imagine Shakespeare to 100%.

2 Likes

In-Context Learning with Long-Context Models: An In-Depth
Exploration
Is there Any reason to fine-tune anymore? Just RAG your fine-tune dataset to use as ICL!

https://arxiv.org/pdf/2405.00200

1 Like

Models that can cost $1.00+ of in-context input to achieve the desired results is one reason to fine-tune…

Ah yes. Forgot not everyone has 4 x RTX 6000 under their desk…

But seriously, in a follow-on comment she says you can re-use attn blks. I know, not on existing cloud APIs.

More useful, I’ve found 3-4 examples does wonders for most prompts…

1 Like

Very cool paper out of stanford/google

Given input length n, previous works have shown that constant depth transformers with finite precision poly(n) embedding size can only solve problems in TC0 without CoT.

This is fascinating to me atm because 4o seems to be almost at 3.5 level of ‘without CoT’ capability. Did they reduce the depth in order to make it cheaper and faster?

https://twitter.com/JoshPurtell/status/1790102029773246861

What other capabilities will be degraded because of this?

1 Like

As a statistician I have a huge problem with the benchmark as used and reported.

Performing 3 trials with success being passing 1 doesn’t have enough statistical power to support the claims being made.

As a simple example, say one model’s true rate of success at a given level is 40%, there’s a 21.6% chance that model will fail at that level. If another model has a true rate of success of 20% at that same level, there’s a 51.2% chance the second model will pass at that level. All things being equal, that gives us about an 11% chance of ranking the second model above the first.

It would be much more informative and valuable to perform 20–50 trials at each level until a model has zero success for two consecutive levels.

I’ll look at it a bit later or possibly tomorrow and possibly re-run their benchmark with more replications.

1 Like

I imagine he re-ran the test a few times, so I doubt that’s an issue.

One thing I didnt like was breaking at failure though, so I tried that. Results mostly hold up for gpt-4o, though the success at 30 is interesting.

{10: {‘prcntg_trials_passed’: 0.6666666666666666}, 15: {‘prcntg_trials_passed’: 0.6666666666666666}, 20: {‘prcntg_trials_passed’: 1.0}, 25: {‘prcntg_trials_passed’: 0.0}, 30: {‘prcntg_trials_passed’: 1.0}, 50: {‘prcntg_trials_passed’: 0.0}, 75: {‘prcntg_trials_passed’: 0.0}, 85: {‘prcntg_trials_passed’: 0.0}}

gpt-4
{10: {‘prcntg_trials_passed’: 1.0}, 15: {‘prcntg_trials_passed’: 1.0}, 20: {‘prcntg_trials_passed’: 1.0}, 25: {‘prcntg_trials_passed’: 1.0}, 30: {‘prcntg_trials_passed’: 1.
0}, 50: {‘prcntg_trials_passed’: 1.0}, 75: {‘prcntg_trials_passed’: 0.3333333333333333}, 85: {‘prcntg_trials_passed’: 0.6666666666666666}}

gpt-4-turbo
{10: {‘prcntg_trials_passed’: 1.0}, 15: {‘prcntg_trials_passed’: 1.0}, 20: {‘prcntg_trials_passed’: 1.0}, 25: {‘prcntg_trials_passed’: 0.3333333333333333}, 30: {‘prcntg_tri
als_passed’: 0.6666666666666666}, 50: {‘prcntg_trials_passed’: 0.6666666666666666}, 75: {‘prcntg_trials_passed’: 0.3333333333333333}, 85: {‘prcntg_trials_passed’: 0.3333333
333333333}}

Hmm, tried changing the seed as well: gpt-4o different results, but same perf:

{10: {‘prcntg_trials_passed’: 0.6666666666666666}, 15: {‘prcntg_trials_passed’: 0.6666666666666666}, 20: {‘prcntg_trials_passed’: 1.0}, 25: {‘prcntg_trials_passed’: 0.0}, 30: {‘prcntg_trials_passed’: 0.6666666666666666}, 50: {‘prcntg_trials_passed’: 0.0}, 75: {‘prcntg_trials_passed’: 0.0}, 85: {‘prcntg_trials_passed’: 0.0}}

gpt-4-turbo
{10: {‘prcntg_trials_passed’: 1.0}, 15: {‘prcntg_trials_passed’: 1.0}, 20: {‘prcntg_trials_passed’: 0.6666666666666666}, 25: {‘prcntg_trials_passed’: 0.6666666666666666}, 3
0: {‘prcntg_trials_passed’: 0.6666666666666666}, 50: {‘prcntg_trials_passed’: 1.0}, 75: {‘prcntg_trials_passed’: 0.3333333333333333}, 85: {‘prcntg_trials_passed’: 0.3333333
333333333}}

gpt-4
{10: {‘prcntg_trials_passed’: 1.0}, 15: {‘prcntg_trials_passed’: 1.0}, 20: {‘prcntg_trials_passed’: 1.0}, 25: {‘prcntg_trials_passed’: 1.0}, 30: {‘prcntg_trials_passed’: 1.
0}, 50: {‘prcntg_trials_passed’: 1.0}, 75: {‘prcntg_trials_passed’: 0.6666666666666666}, 85: {‘prcntg_trials_passed’: 0.6666666666666666}}

And for fun I fiddled with prompt placement and changed the names to AAA,BBB,CCC,DDD … same results.

1 Like

2 Likes