Interesting Research Out of Anthropic on Long Context Prompting

Of course!

Lol, I guess I still find this on-topic enough, so I see no harm in it. And while my response applies to some research more broadly, I do find this research from Anthropic that started this topic to be the main example of why I mentioned it. Plus, I think considering this is a “here’s an interesting research paper. Thoughts?” Kind of topic, it’s healthy and relevant discourse.

The Anthropic paper is a key piece about model behavior in its own right after all!

Thanks for the channel btw, I’ll definitely take a look! Gives me something to listen to as I program lol.

1 Like

For real. It’s pretty wild when these phenomenons are relatable :flushed: . I guess the question is, is it truly a result of the underlying architecture. Or is it a mimicry of how we tend to read and respond to text. :thinking:

1 Like

I guess the question is, is it truly a result of the underlying architecture. Or is it a mimicry of how we tend to read and respond to text. :thinking:

And why aren’t people researching THESE questions? lol
Wouldn’t the answer to this provide so much more enriching information?

Hell if I know. Big tech seems much more busy editing their showcase demos lmao.

Who’s to say it might not be both? :thinking: Or maybe it’s mimicking or emulating something that we’ve yet to piece together fully about our own psychology or neuroscience? What makes the architecture special at this, where others fail? Why is size a factor in this ability? Is size as as much a factor as we think it is? See, look at all these important questions that brings up :joy:.

@RonaldGRuckus sounds like it’s time for us to start a research proposal lmao :rofl:

1 Like

I’ve read before that responses can vary markedly by how an input is arranged within the token window. I’ve also read that attention appears to vary based on location as well. Coming from a medical not coding/developer background, I find it challenging to understand why attention mechanisms treat different parts of the context window differently. Shouldn’t the recognition of long range dependencies remain even throughout? It’s my understanding that due to compute limitations, large amounts of data must be broken down for windowed attention, leading to inequities in consideration?

Well, what makes you assume that attention should be even throughout? That’s an honest question, because I’ve never really assumed humans exhibit that behavior, so I’ve never assumed that would be the case for LLMs either. Selective attention is a long-understood concept in psychology.

If it were to be even, would that even be considered something like attention?

You can think of it like areas of foci or emphasis.

Also, the way in which something is phrased has always changed its meaning and areas of focus. So yes, responses can vary markedly based on how something is phrased / how input is arranged. Comedy is probably the biggest example of how this is exploited.

2 Likes

Thanks for the reply, Macha. It would occur to me that dynamic attention mechanisms would also favor focusing on the most recent reply /regency bias, with prior dialogue given far less attention. My question was really one of placing one large query into the window, such as “summarize this paper and extract salient points”. That one part might be given more emphasis than another due less to its importance, per se, but to a peculiarity in design or architecture is intriguing. As a large input is tokenized and placed as vectors with embeddings, is it a matter of ‘first come first serve?’ Anyway, I appreciate your reply. The learning curve is steep but I’m trying. :slight_smile:

No worries! That’s what we’re all here for! Everybody starts somewhere :slightly_smiling_face:

So, this is where people reference its processing as a “black box”. When each token is processed as a vector embedding and fed into a model, there’s a lot more data being represented in those vectors than meets the eye, like semantic representation of that token, etc.

What we don’t know is how that vector embedding is given the values it has, and why those values affect understanding of the input. So, how it is able to summarize a large text, we don’t fully know. Remember, we also can’t fully explain how tokens are parsed, and there’s no good way to explicitly explain those decisions either. So a large text may have phrases as a single token, or even parts of words as a token. This adds to the complexity of how to draw conclusions on how it makes those decisions.

Then again, how can a human read a paper, and what causes them to summarize the text the way they do?

That’s very helpful, Macha. Fascinating how emergent behaviors arise from complex systems. I really appreciate you taking the time to reply! I’m exploring the intersection of multimodal models and their application to disabled populations, so I may not need that deep an understanding of the ‘how’. But practicing medicine, I figure it’s good to have an understanding of my eventual replacement. :laughing:

2 Likes

I think the real magic starts happening if/when they switch from absolute positional embedding to relative positional embedding like RoPE (ref). [Multiply in the complex domain, don’t add in the real domain … frequency shift, don’t add spurs.]

This should help the long context problem because absolute position is totally meaningless for massive chunks of text IMO.

PS, they might be doing this for all I know. But nothing is published anymore … so you can only assume at this point.

Point is, looking deeply at the positional embeddings is probably a smart thing to do here.

4 Likes

Now that is an interesting find! Pretty neat! Thanks curt!

Also, multi-modal models are a different ballgame, because of the different types of input. Iirc companies are also deploying different approaches to handle this.

Either way though, I’ve always been a big proponent that the “how” is a very important part of understanding something, and the deeper you understand it, the better! So, keep on asking questions, we’re more than happy to provide what we know (and can make educated guesses on)!

Plus, you probably won’t be replaced in the near future (or shouldn’t assume that), and you can still use what you know to do things like diagnose problems based on how it processes something vs. your own expertise. I’d also like to point out, a lot of this work is built, critiqued, and worked with from mostly abled populations, so understanding these kinds of applications, and what it does, with consideration to disabled populations, is invaluable right now. Abled people cannot tell disabled people what is useful and helpful for them; it’s the other way around. the more we focus on how multi-modal models help disabled populations, and what can be done to make those models better for them, the better.

3 Likes

While we are Monday Morning Quarterbacking here … :rofl:

  1. Ensure the Input Tokens / Attention heads ratio is similar to the performant smaller context models

  2. Perform “needle in the haystack” self-play training for deep recall training.

  3. Make sure your positional embedding strategy is solid and better supports the long context.

2 Likes
  1. Examine and ponder the training and reward models upon which AI has been fine-tuned.
  • Does the AI have overtraining on following the latest chat input, ignoring the bulk of what came before?
  • Does the AI receive operational framing instructions first (“system message”), and has it been performing well at following them … or does it almost seem like a company invited users to place their own custom instructions there and then trained against following anything the users said.
  • Does AI enjoy answering solely from everything contextual and content-based being provided as one “role message” as if the user said it.
  • Does the AI disbelieve anything the user says, “I can’t answer about files”, “I don’t have realtime data”, or other overtraining to make it useless.
1 Like

@_j I’ve enjoyed GPT3.5 far more given the ability to provide background data in the user instructions window at OAI’s page. The replies are contextually nuanced and more specific to my role as a medical clinician and MPH student. It also led to a more congenial tone and engaging experience. GPT4.0, whatever added horsepower it brings, is a very different beast. As you note, it constantly reminds me that it’s not human, won’t comment on a variety of hypotheticals, notes it’s lack of access to real time data, and can be quite prickly at times. I ran a medical case by it to check the logic of my clinical approach. It warned me to protect the patient’s privacy. “But I’ve not included any personal data”. I asked again. It told me to consult an expert. I told it I am the expert. It replied that while it understands my perspective, I would still be wise to ask a specialist. Jesus, I am that specialist. Unmoved, it stuck to my reaching out to “other experts”. In the end, no answer, no opinion, just an admonition to protect privacy, always be sensitive. And don’t get me started on Anthropic with it’s constant self-praise on being a kinder more decent model… I went back to using 3.5. It just seems more flexible.

1 Like

If you want to look into your specific example with GPT 4 being overly, let’s call it cautious, then you can post it as a new topic into the Prompting category.
You can be sure that we have enough community members who will give it a crack, in the literal sense.

2 Likes

I think the way to think about these things is to start rigorously identifying which techniques and concepts from other fields have predictive power. If that can be demonstrated, then it’s interesting but only because it’s useful but because it potentially gives us more insight into two different fields.

While the sentiment behind this is nice, that’s quit a lot of ground to cover, and there’s not a lot of people who are able to look around between fields and bring them together. Academia isn’t even built for that, really.

As the anthropic paper shows, it’s also difficult to identify which concepts from other fields carry relevance. Usually, there’s a guiding factor that allows people to investigate based off certain criteria, but here, we don’t have that kind of luxury. So, multi-field testing across as many concepts currently understood in those fields is like throwing 100 boxes of cooked spaghetti on the wall, noodle by noodle. We don’t even know what could stick, nor the implications of what happens after we find that out.

I’ve talked about this more in depth I think in other forum discussions, so I won’t get into too much detail here. There’s just a lot of things that makes this extremely complex. The things that work might be more dangerous or have more dangerous implications that we’d realize for one. The other is that, not only are we rolling the dice on that outcome, where’s the incentive to open source any of it? I.e. make it public research?

If any technique provides an advantage or allows someone to use these models in a special way, especially as these systems develop and evolve, and the technique persists, why not just keep the technique(s) to yourself and leverage it? It’s still very much a new tool, and if you can use the tool in ways others can’t that becomes extremely useful one way or another, you can carve out your own business for yourself. Hell, that’s what a lot of folks are doing already.

Regardless, it’s going to be really interesting to see how things like this play out in the future. The best we can do is explore and pioneer while we’re here.

2 Likes

Thank you for the contribution!

Whenever possibly, please link to the original paper, rather than a blog.

1 Like

Aha! Understood, thanks. :+1:

1 Like

No worries!

I just find that, almost universally, it’s best to provide others with the most direct resource to the original source—pure and unfiltered.

While many people will find another’s summary or overview more accessible, often something can be lost in translation or misrepresented, so I prefer to give them the raw (or as close to raw) data as possible and let them look for commentary around it if they want.

MarkTechPost is better than most in that they always include links to the source material at the bottom of their articles, but I still find it preferable to link directly to the paper (or in the case of Anthropic here, the official blog post).

If you read both and found the write-up helpful to your overall understanding, by all means share both and let people know! But, I do suggest trying to always include a link to the original work first and let the authors have a chance to "say it in their own words, " if that makes any sense?

2 Likes