Location of RAG context within system prompt

Hey Everyone.

The traditional RAG approach when working with OpenAI models tends to be this:

[
  {
    "role": "system",
    "content": "General instructions... ...use the following information only to answer the user's question: [INJECTED_CONTEXT]"
  },
  {
    "role": "user",
    "content": "user question"
  },
  {
    "role": "assistant",
    "content": "lovely RAG answer"
  }
]

This is how I’m doing it currently and as each new question comes in, I just update the system prompt i.e. array[0].content to inject new context.

It works well however I was wondering if anyone had tried the following and whether they got better results:

So… imagine the conversation has gone on for a bit. Instead of updating the initial system prompt at the top of the array, you just add another one containing the retrieved context.

For example:

[
  {
    "role": "system",
    "content": "General instructions..."
  },
  {
    "role": "user",
    "content": "user question"
  },
  {
    "role": "assistant",
    "content": "assistant response"
  },
  {
    "role": "user",
    "content": "user question"
  },
  {
    "role": "system",
    "content": "Use the following information only to answer the user's question: [INJECTED_CONTEXT]"
  },
  {
    "role": "assistant",
    "content": "lovely RAG answer"
  }
]

My theory is that this potentially leads to better grounding of the answers in the context. The downside is that I have to remove previous system prompts from the array as the user asks new questions (or else I’d end up with multiple contexts which would confuse the LLM).

Has anyone tried it and did you notice a difference?

3 Likes

I think you’re on the right-ish track.

I’ve taken to putting a system message as the last message after the user query. I don’t put much into it, apart from schema instructions.

I see it like this: the very last handful of tokens dictate what the model “focuses” on, and what information the model should pull out of its context. And when you steer that focus, you can pull pretty much anything out of anywhere in the context as long as it doesn’t conflict with the model’s training data.

You may have seen these haystack experiments:


https://arxiv.org/html/2404.08865v1

Anecdotally, I’d say it’s more important to ensure that the context is short, clean, and relevant. The positioning of reference information isn’t that important, unless you’re trying to break its training (e.g. get it to not do markdown or act like a chatbot)

So yes: I think including soft information at the top (bot role and that stuff) is a good idea because that information is less likely to be actively purposefully recalled - but I wouldn’t waste that real estate on contextual information.

Language that initiates behavior should be the last thing the bot sees. This is also critical real estate.

But contextual information, especially if it’s unlikely to be overridden by training data, can probably be put anywhere.

3 Likes

Thanks for the useful reply.

Yes I’ve seen those haystack results before and since I’m not using massive context windows I’m not overly worried about the positioning of content.

It’s more a question of what the model chooses to focus on in terms of instruction. I agree with your point about the models typically putting more weight on the last tokens as the instruction. With that in mind, assuming grounding the answers in the retrieved knowledge was the MOST important thing for your application, would it make sense to include it just after the user query?

I guess I’m only looking for anecdotal evidence right now and yours is very helpful.

If someone ends up doing some actual research on it then even better.

2 Likes

Maybe… depends on how long the output is going to be.

You know that the output becomes part of the input during generation, so output tokens will push the end of your input towards the middle.

That’s why I mentioned that the end of the input is a good location for schema instructions (CoT is a schema in my mind). Once you get the ball rolling the model can feed off its own examples, so it’s ok that the primer gets lost in the middle.

That’s why I think your strategy only makes sense if your expected output is extremely short. Does that make sense?

1 Like

Yeah that’s a good point. I think you’re right, for short answers it’ll possibly make a difference.

Ultimately I’ll have to go away and try it and report back.

Any updates on potential test cases?

Hi guys, personally I find results better when the context is injected directly into the users message with some adjustments where the user message gets formatting closer to a system message in local context.