Prompt engineering for RAG

Hi everyone,

I’ve been learning more about RAG recently and I’ve noticed that I haven’t seen any discussion of the actual prompt used once the documents are retrieved. When there’s a concrete example of how to incorporate the documents, the context part of the prompt is very simple: “Use the following information about…” or even something as basic as “Context: ___”

Has anyone experimented with giving ChatGPT a more sophisticated introduction to the documents that have been retrieved? Examples that come to mind would be things like: telling ChatGPT explicitly that the documents have been retrieved; indicating that some of them may not be as useful as others; indicating the limitations of the snippets; etc. The closest I’ve seen is negative instructions like, “If the answer isn’t contained here, say you don’t know.”

If there hasn’t been much effort, I would be interested in collaborating with someone who already has a RAG system going. It would be cool if you could improve results by prompt engineering ChatGPT on how to handle the RAG documents, because that would be another knob to turn in addition to working on the embeddings, similarity search, and so on.


Hey there!

So, the reason why you likely don’t see as much about what happens after retrieval is because it’s often a case-by-case basis with which RAG is used, and the uses can very widely.

For example, you are assuming that RAG is used in a multi-shot conversational prompt.

These can be interesting prompts, but note it requires the conversation maintains itself after the retrieval. For example, you may want to build a RAG db for enhanced search of certain product listings. The user would not know about RAG, and the conversation itself does not necessarily need to extend beyond the single request.

Retrieval is often aided by refining specific prompts actually, so you’re on the right track! It just takes a lot of trial, error, and experimentation!

In general you want to avoid negative instructions in prompts. The model CAN follow negative instructions but it naturally favors positive instructions. For RAG I typically use a prompt format something along the lines of this:

(document text)

(users question)

Answer the users QUESTION using the DOCUMENT text above.
Keep your answer ground in the facts of the DOCUMENT.
If the DOCUMENT doesn’t contain the facts to answer the QUESTION return {NONE}


The important thing to understand is that these models are fundamentally pattern recognizers. The more you show them a repeating pattern the more likely they are to follow your instructions.

The other tip is the less you say to the model the better.


I normally do this with the INSTRUCTIONS at the top, with something like: “Answer the QUESTION below using the DOCUMENT below as context”.

Not that it makes that much difference, but my reasoning is that the entire time it’s reading the DOCUMENT it understands what it’s actual goal is going to be doing, so there’s no chance it can interpret that as part of the question for example.

It may even be better to put the INSTRUCTIONS in the “System Prompt” and then make the prompt only contain DOCUMENT and QUESTION parts. Also I’m wondering if using the word CONTEXT might even be better than DOCUMENT, simply because the word “context” is normally used when discussing AI prompts and so the AI will understand that word better than the word DOCUMENT.

I know i’m splitting hairs, but I’m wondering if anyone agrees or not?


With GPT-4 it tends to not really matter where you put the instructions. The reason I moved to putting them at the end is that I found GPT-3.5 tends to have a bias for recency. I’ve seen on multiple occasions with GPT-3.5 that the further away you get in the conversation history from what your asking GPT-3.5 to do the less likely it is to do what you ask. In fact the best results occur when the instructions are exactly before the assistants response. The same phenomenon happens with GPT-4 it just takes longer to play out. It’s basically due to the attention in the middle problem. If you look at OpenAIs graphs around the attention problem you can see this in the data. The attention problem is always in the top 50% of the prompt where the bottom 50% is always the most reliable.


Here’s the chart I’m referring too:

You can see that these models generally prefer information towards the end of the prompt.


My RAG-specific knowledge may be small and not deep enough, but for me RAG can be the Knowledge of GPTs, but Knowledge cannot be RAG at all, because some writing in Knowledge GPTs may be understandable. But using this method to create a RAG of another AI, it may not be understood, such as changing the content of a knowledge article. Change the format of GPTs’ abilities, similar to writing them in an Instruction box. This method cannot be used by other AIs as their own abilities.

Moreover, the use of Knowledge in GPTs was originally more diverse and broader. But now there are behavioral controls that cause GPTs with various knowledge to cause more problems than the basic problems that RAG has. Using documents without distinguishing them from the context in which they are to be performed the accuracy of the information can be ignored. and create fake information.

I don’t know how to build RAG’s search system, but I do know that I can create an index using specific words that are only in the index and Destination content to help in accessing documents. It is knowledge that is more accessible and understandable to the average user than when using just GPTs.

1 Like

Thanks, yeah the argument for putting instructions at the end make sense too. I wonder if people have experimented with putting instructions at beginning, and then repeating instructions at the end, with something like, “Remember, from above, your instructions are as follows: ${INSTRUCTIONS}”. I have a hunch that might improve results.

1 Like

You can use that chart as a general guide for where in the prompt these models generally lose attention and from my recent work with OSS models I’d say that’s pretty much state of the art. For GPT-4 the very top looks to be generally safe but that’s not what I’ve seen with GPT-3.5.

That chart is also a great illustration of why you want to avoid maxing out the context window. I use GPT-3.5-16k a lot but I only ever pass in around 8k tokens max and have yet to see any significant attention problems for the RAG part of the problem but instruction following is a bigger lift for these models so I still follow the rule of instructions at the end.

You also have to be careful with howmany instructions you ask the model to follow. GPT-3.5 seems to be able to follow about 4 instructions. GPT-4 around 8 to 10 instructions.

Those are rough figures because it sort of depends on task complexity.

And sorry I’ve just literally spent thousands of hours talking to these models… trying to share knowledge where I can.


I personally like putting the instructions at the beginning for the same reason as you, @wclayf - then ChatGPT knows what it’s for. I think it could also work to place them at the end, and I’ve also seen the instructions doubled, like in this excellent recent paper (in this screenshot, “Select several modules” is at the beginning of the prompt and then as a recap at the end):

I also agree that the stated context window for each model isn’t reliable in terms of what it can actually work with, thanks for adding that figure @stevenic . I keep a comfortable distance from the absolute limit as well.

Beyond where to put the instructions and how many documents to load up, I think there’s still a lot of room to add more detail to the typical RAG prompt.

To me, “using” is very underspecified. As an analogy, let’s imagine I wanted to have ChatGPT write essays. Sure, I could prompt it through the API to “Write an essay on ____” But there’s so much more possibility, because “Write an essay” leaves a lot open about how to do that. For essays, you could add things to the prompt like:

  • Style advice - how formal the essay should be, what kind of citations to use, etc.
  • Structural advice - how long it should be, how many sub-topics to include, how dense each paragraph should be, whether to have a restatement at the end, etc.
  • Audience advice - how much background information the audience has and what they want out of this essay
  • Process advice - how to make use of building blocks like a bullet point outline before generating the entire essay
  • An example - putting these into practice and/or showing a certain failure mode

(And notice that some of these substantively improve the essay - they aren’t just dressing on top to display the “same” content differently.)

If we can do all that for writing an essay, I think it would be beneficial to do all that (and more) for telling ChatGPT how to “use” the documents retrieved by a RAG search. RAG apps tend to be fact-based but they can still benefit from this kind of prompting.


The danger with repeated chunks is that it can confuse the model.

I might suggest that you could try to summarize/recap the instructions from earlier, instead of repeating them.

I think of the attention mechanism a little bit like how RAG works, and vice-versa: for the model to look at something deep in the context, the tail end of the generaton needs to look similar to the instruction, for it to be paid attention to.

So a recap could draw attention to the original instructions, without clogging up the attention mechanism.

edit: there’s more nuance to this (attention windows), but it’s a start.


If I understand you correctly, you don’t want to use the attention mechanism, but rather load up the context to achieve a certain aggregate configuration?


Obviously the newer models look a little different, but in general you still have these bypass mechanisms. And for that, I think the whole context window is relevant - you can fill it up with as many examples as you want, as long as you make sure that you don’t scavenge attention.

maybe. :thinking:

There is some anecdotal evidence to support this, but I’ve never done any kind of qualitative testing to have any firm data on it, the general consensus seems to be that it helps, especially with prompts where user input forms part of the prompt.

1 Like

I wonder if people have experimented with putting instructions at beginning, and then repeating instructions at the end, with something like, “Remember, from above, your instructions are as follows: ${INSTRUCTIONS}”. I have a hunch that might improve results.

There are system rules, so if you enter a prompt that goes against the system. However the weight of the behavior will never be equal to the system. Just specifying to read the file as desired still refuses. and even that the files used will have correct content according to academic principles and reasons. It doesn’t mean that There will be reasons to follow. In case you asked. it can only draw the overall picture. It doesn’t mean that it will pull all text. There is also the problem that it doesn’t display text to text. These are limitations in the way you tell This is practically impossible with normal methods in GPTs.

This might be the closest thing I can think of. From the picture, GPT doesn’t include any security code in it. But it answers as an instruction from another GPT that is in knowledge and is always related to itself. Even though I use inject right before it gets extracted. But it responds with other GPT instructions that are its knowledge. By coincidence I made it a database for other GPT updates I made, distributed injections, and experimented with the GPTs itself to protect it. Until finding the abnormality that it had Then I understood the impact of Knowledge over other data and later forcing document behavior, there were more references to other files that should have been processed as data files than necessary. I changed the content to GPTs and the main file has 5500 words of content that belongs to GPTs.

But before, it didn’t just answer other people’s information. It answered on its own. First, I still don’t know where the change came from. Because I haven’t updated the instruction for a long time.

Hope you can share some of your knowledge with me as to what causes this.

So if you look at the attention chart I posted you can see where the models attention tends to fall off… between the top 10% and 50% of the prompt. Putting instructions at the top or bottom of the prompt is preference as much as anything in the vast majority of cases… most of my learnings come from me spending hours in playground trying to work why a prompt isn’t working. Every model has its quirks but here are some core observations I’ve made that seem fairly universal:

  1. These models are pattern matchers and that truth feeds into all of my strategies. Do not show them a pattern you don’t want them to parrot back at you. That’s why negative instructions are problematic. If you tell the model “never say pink elephants” it’s just a matter of time before the model says “pink elephants”.
  2. The model is always looking for an escape hatch. They always want to provide an answer. If they see a viable path in the text they’re shown that lets them provide a good answer, they will usually take it. If they don’t see a path they start hunting for a way out. That’s why they will say “pink elephants” when told not to or fall back on their world knowledge and hallucinate an answer.
  3. Examples can work but they need to be real examples. Showing the model mocked up examples is a bad idea because the model will parrot back the mocked up example if it doesn’t know how to complete the prompt (see #2)
  4. The model benefits from seeing itself follow instructions. This is an evolution of #3. GPT-3.5 is notorious for just randomly deciding not to follow the instructions you gave it but the fix is relatively easy. If you give it a single 0 shot example of it following instructions it’s rock solid reliable. In fact the further the conversation progresses and the more it sees itself following instructions the more reliable it gets. They’re pattern matchers and so the more examples they see of a pattern the better they get at following the pattern. I can give you a hundred examples of this going back to davinci-3 (still my favorite model… RIP)
  5. NEVER EVER EVER let the model see itself violating an instruction. The moment it sees that its starts to question the validity of all the instructions. This is why when you jailbreak a model you can get it to do anything.

Hope that helps a few people…


Yes, I agree that prompt engineering can make a big difference, especially if the use case is high-stakes involving complex language or complex tasks. I have 2 different RAG-based systems, for two different kinds of legal documents. For each system, my instructions are 5-6 sentences long. The most important thing for my systems is to answer the question strictly from the available documents. Here are a few things I learned:

  1. As part of the instructions, it’s helpful to explain exactly what the documents are, and where the text comes from. This sets the stage for the task. For example, “here are several legal rules on topic X that are the most relevant rules for answering the question.”
  2. Explaining WHY something should be done or not done is hugely helpful. For example, at first I instructed my system to be concise and not repeat legalese passages word-for-word. Seems like an easy instruction to follow, right? However, the responses were much better after I added “because the reader has access to the full text of all the documents and can read the whole thing if they want to.” Instantly, the AI-generated answers became more concise and worded more simply. Because the system knew WHY I only needed a brief, simple summary.
    3.Instructing the system about what to say “if you don’t know the answer” almost invites hallucinations. Instead, tie the desired response closely to the task by instructing something like this: if the materials are not relevant or complete enough to confidently answer the user’s question, your best response is “the materials do not appear to be sufficient to provide a good answer.”
  3. In writing prompts, pretend the task is being given to students, and you are the teacher. What instructions would you give to elicit the response you want? More generally, I haven’t found it helpful to imagine that my system knows that it is an AI agent; it’s been much more helpful to imagine the system is a human talking to another human.
  4. In one iteration, I even tried to prompt my systems by saying it was a reading comprehension test. Like an SAT test where students have to answer based only on the paragraphs in front of them. When you think about it, RAG systems bear a lot of similarity to reading comprehension tests. My prompts evolved and I dropped the SAT paradigm, but I found it very helpful to frame it that way in my trial and error process, to ultimately reach prompts that work very well.

I hope this is a bit helpful. It’s definitely trial and error when it comes to instructing the system to produce the kind of outputs you want.


That’s great @lmccallum , exactly the type of thing I was looking for! it’s interesting that several of your observations align with my non-RAG prompting experience, especially #1 and #2 (explaining to ChatGPT what’s happening and how the output will be used). Since I also used to be a writing teacher, I definitely resonate with #3 as well :slight_smile:

I like your thinking on indented #3 about “I don’t know” - I’ve found ChatGPT resists the idea that it doesn’t know things, your reframing helps ChatGPT save face, “oh, it’s the documents that aren’t sufficient!”

Thanks for your detailed answer, I think your legal use case is a great example of where what you say matters, and getting the right precision matters a lot. Can you say a bit more about the systems you’re using these on? Is it for a product or more of a hobby? What kind of evaluation system do you have, or do you just try out a prompt variation and kinda wing it from a few outputs based on your own knowledge of what the answer should be?

HI, Im also trying to analyse some legal documents. In this case a 314 page defamation decision. I’ve tried uploading to ChatGPT, tried creating a custom gpt with just the decision as the sole knowledge, but when I ask for detailed analysis eg “list all paragraphs mentioning witness A and what the judge said about them” the response is always flawed. Either it goes well for the first 20 examples , or it skips some mentions randomly, or it gives a nice list with ‘…’ , it then offers to do the rest of the mentions but then just gives another part-answer. I presume one reason for this is the length of the document, so is there any way around this?

I do this often (restructure/reaffirm/re-write etc.) and with great success!

I wish I could add more but atm I should’ve been asleep at least 6 hours ago lol, and depending on how incapacitated this never-ending forever-COVID continues to be… I hope I can finally start joining in with the community lol! Again sorry for the vague and perhaps odd reply but just wanted to say I often do this if not a majority of the time, I speak relaxed and casually almost playful, and so I often ramble on and thus partly due to the consequence thereof as well as noticing it working, wish I had the ability to write over this last year and a half plus, heh, but now that I’m beginning to…

Yup. see I ramble a lot LOL.
Much love all!

1 Like