How to structure system prompt, RAG context, and user input for multi-turn RAG-based chatbots using OpenAI Chat Completions

Hi everyone,

I’m building a multi-turn RAG-based chatbot using the OpenAI chat.completions API, and I’m trying to determine the best strategy for handling the system prompt in a long-running conversation.

I use a detailed system prompt (roughly 400 tokens) that defines the assistant’s tone, formatting requirements (e.g., citation style, markdown usage), and response rules. For each user turn, I also inject:

  • A retrieved context from a vector store ,
  • The user’s actual question.

My core question is:

When structuring the messages array, should I:

Option A:
Send the system prompt once at the start, and then continue appending user/assistant messages?

python

CopyEdit

messages = [
    {"role": "system", "content": " You are a helpful AI, a legal search assistant with citation, markdown, tone, and style constraints ... [full system prompt]"},

    {"role": "user", "content":"Context:\n[Relevant law summaries]\n\nQuery:\nWhat are the constitutional limits on searches?"},
    {"role": "model", "content": "...[response with citations][1][2].."},

    {"role": "user", "content": "Context:\n[New retrieved excerpts]\n\nQuery:\nHow does the exclusionary rule apply in cases involving good faith exceptions under the Fourth Amendment?"},
    {"role": "model", "content": "...[response][3]."}
]

Or Option B:
Resend the system prompt before each user query, like this:

messages = [
    {"role": "system", "content": " You are a helpful AI, a legal search assistant with citation, markdown, tone, and style constraints ... [full system prompt]"},
    {"role": "user", "content":"Context:\n[Relevant law summaries]\n\nQuery:\nWhat are the constitutional limits on searches?"},
    {"role": "model", "content": "...[response with citations][1][2].."},

    {"role": "system", "content": " You are a helpful AI, a legal search assistant with citation, markdown, tone, and style constraints ... [full system prompt]"},
    {"role": "user", "content": "Context:\n[New retrieved excerpts]\n\nQuery:\nHow does the exclusionary rule apply in cases involving good faith exceptions under the Fourth Amendment?"},
    {"role": "model", "content": "...[response][3]."}
]

I’m particularly concerned with:

  • Best practices around context window usage and efficiency,
  • Whether repeating the long system prompt is necessary for maintaining behavior consistency,
  • And how the model handles system role memory across turns.

Any insights, links to cookbook or recommendations would be greatly appreciated! Thanks in advance.

1 Like

Usually a single system prompt at the beginning should be enough.

There are many RAG articles in the cookbook, but depending on your needs it is also possible to directly upload a PDF and make questions.

You can start with this one, that teaches how to build a vector store from multiple PDFs.

Most of these can be tested in the playground, so I suggest experimenting with some practical data, as every RAG behaves different depending on how well the data is pre processed or is just raw documents, if users keep very long conversations or short ones, etc.

2 Likes

Did you find a solution for this approach? I’m also stuck on the same problem. I want to know how the system prompt should be written for RAG applications. In a production environment, we need to manage context for each user and maintain the token limit as well.

Chat Completions doesn’t persist state on the server — whatever you want the model to “remember” (system rules, prior turns, retrievals) must be included in the messages you send each call. If you want true server-side conversation state, use the Responses/Assistants APIs. Check OpenAI documentation on Chat Completions

Recommended message ordering (per API call)

  1. system — short, authoritative instructions (tone, citation rules, hard constraints). Keep this concise.

  2. assistant — brief scratchpad or last assistant reply if you want the model to continue a prior answer.

  3. retrieved context — inject RAG results as a single controlled block (labelled, numbered, with provenance). Put this as a user-role “Context:” message or as an assistant-role “Retrieved evidence:” message — either works, but be consistent.

  4. conversation history (last N turns) — include only recent turns or a compressed summary for long chats. Don’t re-send the entire transcript every time.

  5. user — the current user query.