Should rag retrieved documents be sent as system or user messages?

Should rag retrieved documents be sent as system or user messages?

I’ve seen both ways in samples found in the internet, maybe it depends if you use 3.5 or 4 ?

1 Like

There is no complete right or wrong here. It somewhat depends on what the purpose of the RAG retrieved documents/contents is. More common is to include it as additional context as part of the user message.

If you can share more about your specific use case, it would be easier to provide a recommendation.


I don’t get it when you say “depends on what the purpose of the RAG retrieved documents/contents”

One of the few things I expected all agree is what rag is for :slight_smile:

I think the general purpose of RAG is fairly clear but how you use the content that is retrieved in generating a model response can differ quite significantly.

I have cases where I use the retrieved content to tailor system messages. In other cases, the content is injected into the user message.

As said, if you are willing share more what you are trying to achieve, then we can provide additional guidance.


It’s “just” a classic chatbot assistant that provides to the user information and guidance about the usage of a web site . Rag documents are MD files (that are also used to generate a static help companion web site).

Thanks Enrico

Thanks for clarifying. In this particular case, it sounds like including it in the user message would be the most intuitive approach. When you do that, make sure to clearly demarcate this as additional context/input to separate it from the actual user question. Additionally, ensure to include in your system message - provided you have one - an instruction to rely on the context to respond to the user question.

By the way: Depending on the type of user questions you expect, you might want to avoid including full documents in your message due to costs and potential other constraints and instead just retrieve the relevant chunks from these documents that are required to respond to the user question.

1 Like

Thanks for the info.
Of course , I am using embeddings to provide relevant documents only

There is no right answer because we don’t have control of ChatML roles that would be useful or could even be fine-tuned on. No “knowledgebase” role to be set.

If “functions” in chat completions seems a logical way to return natural language knowledge to an AI, OpenAI wants to take away creative use with the enforcement of “tools”.

Then assistants makes RAG even more useless, with the messages being persistent and blowing up in size.

Closeness to the user input makes messages relevant, and placing an “internal thought” in an assistant role before user input seems to work well, and the AI isn’t confused that a user said something they didn’t or try to continue on it. A preface “here’s knowledge automatically retrieved from the database based on relevance to the latest user question”.

I guess the relevant documents with a preamble must be put before the conversation history…
… or does this “it depends” as well :slight_smile: ?

Maintaining conversation history of RAG injection would be wasteful and counterproductive (such as what assistants does). It would give the AI distraction and push chat history out of what can be re-sent within a budget.

It should be dynamic and adapt to the current input and its context.

It can appear as a turn of “messages” also with a conversation history, but you are the one in control of placing it where it needs to be, just for a particular API call.

I mean user / ai conversation history, not history of provided rag documents.
Still I send the latter somehow has well to manage follow up question…
I gave more than one try to condensating user/ai conversation history into one as a way to avoid sending also rag history … it was always an epic fail :slight_smile:

I’ve got little hands-on experience with RAG myself, but here are two thoughts:

  • If the retrieved documents are not from a trusted source, there could be the risk of prompt injection. When placed inside the system message, such injected prompts could have wider-ranging consequences. This even may apply if the documents just contain weird formatting.
  • Has anyone considered providing the documents via a tool message (plus generating a hard-coded assistant message with a matching function call before that)? Yes, it would use more tokens, but maybe would be a more idiomatic approach to label the data.

I am talking about rag , not a tool that search on the web , hence documents from rag are ok.

Still I have seen code where rag is put behind a tool/function … it makes sense but I have to check how it works when user jumps from one context to another … still it’s likely the way to go on the mid / long term

Fake tool call is a possibility - and required fake IDs while they still are allowed. Tools with knowledge come after user input.

However, you have to actually provide a tool specification to API to enable tool calling, and the AI bases its understanding somewhat on what it received in the specification. Then you are going to need to come up with text to place to simulate the AI calling, another language model call (or place your HyDE there). Actual “tools” also gets your context hit with OpenAI’s injection of parallel tool wrapper.

system: you are OpenAI API expert programmer.
user: how do I send an image for computer vision?
assistant: {tool_call.knowledge_search(“API with vision”) {id}
tool: “knowledge result 1: Both GPT-4o and GPT-4 Turbo have vision capabilities…” {id}

(now what if the AI wants to perform your fake search again?)

Yes, this is precisely what I meant. Maybe the function argument would not even be required.

I hope they will never change that, I depend on this for few-shot-prompting some of my agents about how to use some tools…

Is this really true? At least when using the chat completions API, I think you only need to provide tool specs when the next assistant message may or should call tools. I don’t see a problem here, but maybe the Assistants API looks different.

Yes, if you don’t send functions or tools, you get an AI that has no idea how to write them - even when you provide the same system message with tools yourself.

Placing the tool messages might not need the specification. It would be a quick experiment to try it out. Thinking you get blocked though.

As a funny anecdote, I tried to build a simple RAG system based on chunks from a book. At some point, when asking particular questions to the agent, it would “hallucinate” further questions and answer them, too. Turned out that the RAG context included questions from a FAQ section in the book and when presented with these, the model assumed it should answer these questions, too. Enclosing the RAG contents in three backticks (```) solved this issue, but it shows how fast inadvertent prompt injection can happen.

If performance is not an issue, delegating the invocation of the retrieval system to the assistant is probably more idiomatic way, and you can get built-in HyDE within the same API request.

Yes, depends on whether your agent should be able to call these tools themselves again or just interpret statically provided content.

What do you mean? For every tool message there must be a previous tool call in the conversation (quoting from an API error: "Invalid parameter: messages with role ‘tool’ must be a response to a preceeding message with ‘tool_calls’). On the other hand, tool specs for prior tool calls are not required (I do not send the struck-out tool spec to the API when requesting the final assistant message):


Yes, I understand the requirement for the assistant-tool pairing with matching IDs. However, the unresolved question is if the API accepts and the AI understands without the actual tool specification of tools=json in the parameters.

The answer is it does understand:

With tool schema placed:

The prices for the iPhone 15 cases are as follows:

  • Black case: $29.99
  • Red case: $34.99

Without tool schema placed:

The price for the iPhone 15 case in black is $29.99, and the red case is priced at $34.99.

Demo knowledge placement, with tool commented out of API parameters
from openai import OpenAI
import json
client = OpenAI(timeout=30)

# Here we'll make a tool specification, more flexible by adding one at a time
# And add the first
    "type": "function",
    "function": {
        "name": "search_knowledge_base",
        "description": "Searches a knowledge base for product info based on the provided query. Returns relevant answers.",
        "parameters": {
            "type": "object",
            "properties": {
                "search_query": {
                    "type": "string",
                    "description": "The query string to search in the knowledge base."
            "required": ["search_query"]

# Then we'll form the basis of our call to API, with the user input
# Note I ask the preview model for two answers
params = {
  "model": "gpt-4-turbo",
  #"tools": toolspec,
  #"tool_choice": {"type": "function", "function": {"name": "search_knowledge_base"}},
  #"tool_choice": None,
  "max_tokens": 90,
  "messages": [
        "role": "system", "content": """
You are Productbot, answering about our products on our web site.
Knowledge cutoff: 2023-04
Current date: 2024-05-31

        "role": "user", "content": "how much does your red or black case for my iphone 15 cost?"


# Informing the user that the AI is searching for the prices, and emitting fake call
    "role": "assistant",
    "content": "Let me check the prices for the phone cases in those colors for you...",
    "tool_calls": [
            "id": "call_xy123abc",
            "type": "function",
            "function": {
                "name": "search_knowledge_base",
                "arguments": "{\"search_query\": \"price of black phone case model X\"}"
            "id": "call_xy124abc",
            "type": "function",
            "function": {
                "name": "search_knowledge_base",
                "arguments": "{\"search_query\": \"price of red phone case model X\"}"

# Simulating responses from the knowledge base tool calls in a database-like format
    "role": "tool",
    "tool_call_id": "call_xy123abc",
    "content": "Product: Phone Case Model 334-bk, Color: Black, Price: $29.99"
    "role": "tool",
    "tool_call_id": "call_xy124abc",
    "content": "Product: Phone Case Model 334-rd, Color: Red, Price: $34.99"

# Make API call to OpenAI
c = None
    c =**params)
except Exception as e:
    print(f"Error: {e}")

# If we got the response, load a whole bunch of demo variables
# This is different because of the 'with raw response' for obtaining headers
if c:
    headers_dict = c.headers.items().mapping.copy()
    for key, value in headers_dict.items():
        variable_name = f'headers_{key.replace("-", "_")}'
        globals()[variable_name] = value
    remains = headers_x_ratelimit_remaining_tokens  # show we set variables
    api_return_dict = json.loads(c.content.decode())
    api_finish_str = api_return_dict.get('choices')[0].get('finish_reason')
    usage_dict = api_return_dict.get('usage')
    api_message_dict = api_return_dict.get('choices')[0].get('message')
    api_message_str = api_return_dict.get('choices')[0].get('message').get('content')
    api_tools_list = api_return_dict.get('choices')[0].get('message').get('tool_calls')
    # print any response always
    if api_message_str:

    # print all tool functions pretty
    if api_tools_list:
        for tool_item in api_tools_list:
            print(json.dumps(tool_item, indent=2))

The OpenAI model spec to me indicates that in the future it will have it’s own role:

Conversation: valid input to the model is a conversation, which consists of a list of messages. Each message contains the following fields.

  • role (required): one of “platform”, “developer”, “user”, “assistant”, or “tool”
  • recipient (optional): controls how the message is handled by the application. The recipient can be the name of the function being called ( for JSON-formatted function calling; or the name of a tool (e.g., recipient=browser) for general tool use.
  • content (required): text or multimodal (e.g., image) data
  • settings (optional): a sequence of key-value pairs, only for platform or developer messages, which update the model’s settings. Currently, we are building support for the following:
  • interactive: boolean, toggling a few defaults around response style. When interactive=true (default), the assistant defaults to using markdown formatting and a chatty style with clarifying questions. When interactive=false, generated messages should have minimal formatting, no chatty behavior, and avoid including anything other than the requested content. Any of these attributes of the response can be overridden by additional instructions in the request message.
  • max_tokens: integer, controlling the maximum number of tokens the model can generate in subsequent messages.
  • end_turn (required): a boolean, only for assistant messages, indicating whether the assistant would like to stop taking actions and yield control back to the application.

Subject to its rules, the Model Spec explicitly delegates all remaining power to the developer (for API use cases) and end user. In some cases, the user and developer will provide conflicting instructions; in such cases, the developer message should take precedence. Here is the default ordering of priorities, based on the role of the message:

Platform > Developer > User > Tool

The Spec itself has “Platform” level authority, and effectively, one can think of the Model Spec as being implicitly inserted into a platform message at the beginning of all conversations. Except where they conflict with the Model Spec or a platform message, instructions from developer messages are interpreted as hard rules that can’t be overridden, unless the developer indicates otherwise.

Because yeah. Totally agree. It should not be a user privilege to provide information as facts to the model.