Contextual prioritization with GPT-3.5

Hey friends,

I’m making an app that juggles lots of contexts before passing to the prompt string. I’m finding I have to be judicious about the order of my contexts for them to be front of “memory.”

For instance, if I were to pass this as a context:

“-User info: am a bear. I eat honey. I steal picnic baskets. I believe in conspiracy theories, and that Elvis is still alive. My name is Mortimer.”

If I ask my name, the AI might reply with my name, or it might say “as an AI, I don’t have access to that information”. To vastly increase the likelihood that my most important context is remembered (My name is Mortimer), I put it at the front of the context, like so.

“-User info: My name is Mortimer. I am a bear. I eat honey. I steal picnic baskets. I believe in conspiracy theories, and that Elvis is still alive.”

GPT-4 seems to process short contextual payloads pretty well, but gpt3.5 seems to require this kind of prioritization.

My question is, how can I more affirmatively prioritize contexts without a ballooning payload?

Hi and welcome to the Developer Forum!

This is where NLP methods and lessons come into their own, it seems to not be a coincidence that models started performing so similar to humans right about the time attention was introduced to transformer models, if not exactly, then certainly “similarly” mimicking human language processing.

With that in mind, what part of a conversation with another person do you remember most? I’d argue that the stuff right at the start and the stuff right at the end, with a bias towards the end, the stuff in the middle… not so much. It seems that LLM’s have a similar tendency although with generally more accuracy as their recall is usually much better than a humans.

So, I’m assuming your examples are stylised and not representative of the actual prompts being used, as I would expect everything from a 10 word prompt to be handled correctly, but if that were to be several thousand tokens worth of prompt, yes, the middle would be less utilised, unless you make use of NLP “tricks” YOU WILL PRBOBLY REMEMBER THIS BIT!!! right? Well, you’re programmed to, and as the model has read just about everything… so is it. Use the same methods you would to get a person to pay attention. That’s pretty much it! You can also use ###marker blocks### to draw “attention” to a location as well.

Thank you, that’s very helpful. You’re right, I’m not loading these contexts naturally, but silently and programmatically. If the contexts are included, they are generally quite important, but even the contexts have their priorities.

I didn’t know about marker blocks, though, that is super helpful! I’ll have to do some research into that. :slight_smile:

1 Like

If the model has seen a common method used to draw the readers attention to specific point, it will probably have encoded that into it’s network, the more “”“COMMON”“” the method the \\MORE/// it
----WILL---- have encoded it, 1) this and - This have more meaning than just this. It’s a bringing together of linguistics and language with computation, with an emphasis on the linguistics rather than code. //although {CODE METHODS} are quite good for attention.

1 Like

Hah, I’m unlikely to get that deep in the weeds… I’m more into tricking than training! Right now I just categorize my contexts within the prompt string:

-User info: (one string)
-Assistant Info: (one string)
-Contexts: (several strings combined into a string)
-Vital Statistics: (several strings with largely numerical data combined into a string)

Then combining these into the prompt string, plus the message to the chatbot.

I could probably just have optional marker block insertions into the context strings… :thinking:

Is there a place on the forum for discussing the syntactical cues you are talking about? This conversation has sparked a lot of questions that probably are tangential or off-topic.

With regard to the impact of ordering and “tricks”, I would argue that the sequential nature of the input into an LLM isn’t analogous to the sequence of sounds, words, and ideas a human would process. If I understand the metal model for cognitive processing correctly, the information the relevant parts of the brain receive during a conversation is constantly undergoing a chunking and abstracting process, and that this process is fundamentally dependent on a sequence of time steps. When it comes to how effectively information can be used after the brain has processed, it is strongly associated with the effectiveness of these two steps. The reasoning behind this, if I’m presenting this correctly, is that working memory is highly limited—a concept that comes up a lot when talking about how many numbers one can “hold” in their head—but, working memory is very adaptable to the form of the information. For example, it is much harder to remember a random sequence of twenty character with only a short review than it is to remember a sequence of five random names with four letters each with only a short review. To circle back to the sequential arrangement, one could argue that, of the five names, it would be easier to recall the first name and last name vs any of the three names in the middle. However, the case of five names is a very specific structure that isn’t comparable to all situations. The five names, arranged in a sequence, are five pieces of information that have very similar forms, are chunked in similar ways, and abstracted in similar ways. So, consider the following sequence: “bear?”, bzzzzz, “RUN!”, GetFactoryAccessor(), :expressionless:. This is a totally valid sequence of natural language to read, maybe even process multi-modally, and I doubt is would be strictly easier to remember “bear?” or :expressionless:, or that more or less attention would be given to any element in the sequence solely based on its order.

When thinking about how an LLM might process the same information, or how it might process larger bodies of text, it is important to assess possible causality of input/output results with the understanding that the “attention” mechanism, in the sense of a transformer, is not influenced or dependent on time. The first reason why is that it is processing units of information in parallel, and any sequential relevance has to be precomputed before the transformer can make any inferences from the information. Once positional embeddings are calculated for a given input, the information isn’t processed in any particular order. The second reason is that the order in which these parallel processes complete (if I’m understanding the decoder layer correctly) doesn’t have an effect on the order of the output. I would argue that a better analog for describing the difference in results for different ordering of the same information is the process of matching two fingerprints. Going back to the list of names example, you could take each name and annotated it with a list of positional information like “comes before Tim”, “is in a list”, “is the head of a sequence”, “is separated from Jackie by one name”, “is in a sub-set with Jules, Jackie, and Rees”, “all names come earlier than Parsha”, which give the location of Tim, Jackie, and Parsha but not Rees and Jules. If this is the information one has available for predicting what name comes after Tim, it is reasonable to focus on Jackie being likely to come next and Parsha being unlikely to come next. However, if the list of names that has been generated so far is Rees, Jules, Jackie, and Parsha, Tim it would be reasonable to focus on different positional information and conclude that Tim is at the beginning of a new list, or even that Parsha comes next because all names come earlier than Parsha.

When the model has received your user info and your inquiry into the name assignment, “my name is”, in the user info, it could be that, during its training, it saw a lot of similar sentences written such that a name assignment, like “my name is”, comes at the beginning. If this is the case, it could be that those examples were written with the name at the beginning because there was a cultural norm of writing important information at the beginning of a list, and one could reasonably image that such a cultural norm would be a natural accommodation for what people remember most easily. However, the model could instead be matching an even more common pattern, which I’m making up, where the names given at the beginning of sentences are most often the correct answer to “What is my name?”, while names that come at the end of sentences are referring to someone that is separate from the author of the sentence. If this happened to be the case, it would make sense why the output the model gave was such that is didn’t try to give you information about what it inferred was another person.

So, to try and land the plane with something practical, here is an alternative way to get the desired output without reordering the list (although it probably seems impractical). The question you are using to ask about your name (which you didn’t provide, so I’m presuming a lot :sweat_smile:) could be made more specific, like, “What is my name, which is listed in the user info?”

(please let me know if that made sense or not. I’m happy to edit it until it does make sense.)

1 Like

This is awesome. It is the foundation of a concept used for persona definitions called PList Style. Here is a very, very detailed description of how syntax patterns similar to yours can guide the model’s output. The wiki article is specifically reviewing Ali:Chat + PList Style

Happy for the forum to be a place of learning and discussion on topics that impact developers, primarily this is for discussing OpenAI projects and services but I’m totally fine with topics that relate to it and add value, as I think this does.

Humans are able to remember effectively infinite stories with a high degree of accuracy, it’s one of the “mega memory” tricks people can learn for card deck memorisation and the digits of Pi. While it’s true that unrelated item memory is quite small, this “story” mode is vast, so while I agree with your point that ChatGPT is not doing the same thing as a continually cycling and re-evaluating brain, it’s spooky how close it seems to mimic it and also it’s failure modes.