How to pass conversation history back to the API

Hello, I feel this is a silly question so please forgive if it is -

I am building a chatbot, and using a redis cache to store previous messages.

I am currently “passing it back” to the API by:

  • manually concatenating previous messages into a formatted string like:
previous_convo = """
CHAT HISTORY
--------
USER
message1

ASSISTANT
message2

USER
message3

ASSISTANT
message4
"""

and then doing a:

prompt = f"""
Respond to the users current message: 

{previous_convo}

CURRENT QUESTION: 
{current_question}
"""

And then using the completions API like this:

from openai import OpenAI
client = OpenAI()

completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
  ]
)

Note: there is a single “user” message containing the entire chat history in the string itself

I do this for every message in the conversation! So every call to the LLM has a single dict for the user message.


Option 2 I thought of would be something like using a list of dicts of previous messages:

previous_messages = [
{"role": "user", "content": messge1},
{"role": "assistant", "content": messge2},
{"role": "user", "content": messge3},
{"role": "assistant", "content": messge4},
]

and then using the API like this:

from openai import OpenAI
client = OpenAI()

system_prompt =  {"role": "system", "content": "You are a helpful assistant."},
messages = [system_prompt]
messages.extend(previous_messages)
messages.append("role": "user", "content": "message5 content"})

completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=messages,
)

So in this case the full request would look like:

completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[ 
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "message 1 content."},
{"role": "assistant", "content": "message 2 content"},
{"role": "user", "content": "message 3 content"},
{"role": "assistant", "content": "message 4 content."},
{"role": "user", "content": "message 5 content."}
 ],
)

Context length issues aside (assume short conversations):

Question: how should I be “communicating” the previous messages back to the LLM?

1 Like

You are correct to re-think your original idea.

The messages format is just that: a mechanism to also pass the previous interactions between user and assistant, before the most recent user input.

The final form you show is correct.

When the conversation length grows significantly, you can (and must) start discarding older messages, with opportunity for innovation in still managing context so the AI appears to have high-quality memory.

2 Likes

@_j

Hm - so how do I prompt the llm with instructions to reference the chat history and parts of it? In the system message?

something like

system_msg = "based on the chat history, summarize the current message into a standalone message"

messages = [ 
{"role": "system", "content": system_msg},
{"role": "user", "content": "When does the sun come up in the summer?"},
{"role": "assistant", "content": "usually 6am"},
{"role": "user", "content": "CURRENT MESSAGE: What about next season?"},
{"role": "user", "content": "SUMMARY: "}
 ]

Is this correct? Two user messages back to back?

EDIT: In my small tests both option 1 and 2 seem to “work” well enough for both of these small toy examples

The AI will always consult the conversation history for content. You don’t need to tell it anything about its conversation ability or have it rewrite things.

{“role”: “system”, “content”: “You are ExploreAI, an outdoorsman’s expert assistant.”},
{“role”: “user”, “content”: “What’s the tallest mountain in the world?”},
{“role”: “assistant”, “content”: “The tallest mountain in the world is Mount Everest, which reaches approximately 29,029 feet above sea level.”},
{“role”: “user”, “content”: “How does its height compare to the second tallest?”},
{“role”: “assistant”, “content”: “Mount Everest is about 778 feet taller than K2, the second tallest mountain, which stands at approximately 28,251 feet above sea level.”},
{“role”: “user”, “content”: “Are they both part of the same mountain range?”},
{“role”: “assistant”, “content”: “Yes, both Mount Everest and K2 are part of the Himalayan mountain range, which extends across five countries in Asia.”},
{“role”: “user”, “content”: “Which countries do they lie in?”},
{“role”: “assistant”, “content”: “Mount Everest is located on the border between Nepal and China, in the Mahalangur Himal sub-range of the Himalayas. K2 is located on the China-Pakistan border, in the Baltistan region of the Karakoram range.”},
{“role”: “user”, “content”: “Is one harder to climb than the other?”}

You can see this simulated chat (which I had AI create…) has questions that require the previous context. Each question would be hard to answer without seeing the previous flow, and yet becomes easy when the history is given.

About the only case where you’d want to tell the AI “you are a conversation AI that will be given a history of past chat to maintain an illusion of memory” is when it would assist the AI in planning things - like it knows it can engage in a long interview and doesn’t have just one turn to achieve its goals.

One technique to save some expense is to take several turns of the oldest chat and have another AI call just make them into a summary. Then we know the topic is still mountains instead of just a big discard. The AI doesn’t really need to know exactly what it wrote before (it still has the skills to write that again), unless you are specifically discussing prior AI compositions for refinement or code for improvement.

Sorry - what if I want to explicitly ask the LLM to rephrase the final question in a conversation (for RAGing)? In your example, the next response from the LLM should be:

“Is mount Everest harder to climb than K2?”

I need to somehow instruct it to “look at previous messages” to rephrase the conversation

@_j - observe how W&B does it for their RAG bot: wandbot/src/wandbot/chat/query_enhancer.py at feat/v1.1 · wandb/wandbot · GitHub

that’s more in line with my first idea!

It sounds like you propose another AI language model call and expense to simply rewrite the user input to resolve prior references and make a standalone question.

For example, having my same AI make that a standalone question, we get: “Is Mount Everest or K2 considered more challenging to climb, especially in terms of the technical aspects of mountaineering and the overall risks involved?

The AI even embellished the question beyond the user input.

That is not necessary, however. That would double the expenses of running your AI.

It is also problematic. What if I was having the AI work on 300 lines of Python code, with revisions and different tasks on my code base being performed? A standalone question conversion could be requiring the AI to resynthesize all the parts and rewrite all that code again so it could be understood as if in a new session, a context larger than the AI model would be willing to write as output.

Instead, you can just run embeddings on several recent conversation turns. The embeddings AI can understand conversational flow, and semantics will be drawn towards the conclusion and the next output. It is a magnitude less expensive and faster.


If you were still going to invest in doubling the output cost, you can do better. You’d also take the opportunity to have the separate AI, that is instructed on this singular task, write a hypothetical answer (using just the knowledge it has available). This answer, also included in the embeddings input you send, can be more similar to documentation than just a question alone, even if it contains none of the actual data that would be retrieved from knowledge.

Then your chatbot messages for this specific task would be like:

system: You are an automated data preparation AI. You take the last and final user input in a chat session given to you, and resolve all pronouns, anaphoric and cataphoric reference, and include the required prior context about the topic to now make that a standalone question, suitable for beginning a new chat session. Your response:

  1. produce the new high-quality standalone user question;
  2. produce the best type of brief assistant answer that you can write, which needs to be only hypothetical, answered as well as you can, but MUST be answered and never denied, as it is for augmentation and not for the user to read.

user: Here’s prior chat. Rewrite the final question:
(plain text user/assistant exchanges)

All this rewriting delays the user in getting their first token of actual language back from their input.

You should use the ChatML approach, with the User/Assistant alternating streams.

But the old school completion approaches would probably work too.

Each model has its own input style, some have the completion approach, similar to many open source models and GPT-3 models of yore.

But the push is for ChatML style User/Assistant, as everything is structured in the JSON, without the concatenated style from the completion models.

Not surprised both styles work for the newer models.

1 Like

If you “export” the chat history from your standard chatGPT account (all those questions stored on the left). You will see in the JSON output ID’s for sessions and questions (and re-questions etc). That indicates to me that the session is being stored somewhere as the chat takes place. I don’t keep any history in my chat sessions and they seem to respond the same way as when the first models came out (and then I did throw back up the whole history).

This is not quite correct. I’ve been using this method since I started, and for my particular use case (maintaining the general context of a conversation) it has worked almost flawlessly. Also, while I use gpt-4-turbo for the final responses, I use gpt-3.5-turbo to generate the standalone questions, so my cost to maintain context is actually less than sending the entire chat history in each prompt.

Does it unnecessarily embellish sometimes? Yes.

In a case like this, you are absolutely correct. My use of the “Standalone” question is simply to maintain conversational “context”. If I need to maintain conversational “details”, this is not an ideal solution.

1 Like

from openai import OpenAI
client = OpenAI()

system_instruction = “You are a helpful assistant. Based on the chat history, please rephrase the final user’s question into a standalone question.”

messages = [
{“role”: “system”, “content”: system_instruction},
{“role”: “user”, “content”: “When does the sun come up in the summer?”},
{“role”: “assistant”, “content”: “Usually around 6am.”},
{“role”: “user”, “content”: “What about next season?”}
]

completion = client.chat.completions.create(
model=“gpt-3.5-turbo”,
messages=messages,
)

The completion[‘choices’][0][‘message’][‘content’] should contain the rephrased question.

print(completion.choices[0].message.content)

This is a case where the “messages” work against you; you’ll have better chat summary and input consolidation if you supply the data to be acted on within a single “user” input, with the “chat” formatted for understanding within that single input and task.

After having a chat with gpt-4, we switch to your model and prompt. It doesn’t offer enough that we can delete all prior input…or make a better text for embeddings.

I tried to improve on this across all gpt-3.5 models, with more task-related system prompt, and prefacing the final question with instructions. Results were poor, from just repeating the question to answering it (and especially the most recent gpt-3.5-turbo-0125 should not be trusted to comprehend a context. Only GPT-4-0613 could follow an instruction to contemplate and rephrase to standalone when still in individual chat messages:

How long should the yudane premix, which involves mixing flour with boiling water as part of a Japanese breadmaking process, rest before being incorporated into the dough?

vs GPT-4-0314 one message:

In the Yudane method for making soft bread, how long should the premixed slurry of flour and boiling water rest before incorporating yeast and other ingredients?

The only 3.5 now capable of not just repeating is gpt-3.5-turbo-0301. -0613 could have followed such instructions, but that quality was broken months after release. gpt-3.5-turbo-0613 with system+user instruction + user data:

How long should the yudane premix rest before being incorporated into the dough?

gpt-3.5-turbo-0301 - more talented, but less predictable output, showing below.

Maybe adding a directive or guidance as part of your system’s instructions can help direct the GPT-3.5 model’s behavior, especially to avoid mere repetition of the question and ensure it understands the context better. I always use contentAvoidanceDirective .

system_instruction = “”"
You are a helpful assistant. Based on the chat history provided,
please rephrase the final user’s question into a standalone question.
Make sure to incorporate the context of the discussion into your response
without repeating the question verbatim.
“”"

messages = [
{“role”: “system”, “content”: system_instruction},
{“role”: “user”, “content”: “When does the sun come up in the summer?”},
{“role”: “assistant”, “content”: “Usually around 6am.”},
{“role”: “user”, “content”: “What about next season?”}
]

completion = client.chat.completions.create(
model=“gpt-3.5-turbo”,
messages=messages,
)

or

here’s another example how I might add it to my prompts:
contentAvoidanceDirective = “Please rephrase the final question based on the chat history, without repeating it verbatim.”
finalMessage = “What about next season?” # This is an example of a final message that needs rephrasing.
cleanedContent = “”"
When does the sun come up in the summer?

  • Usually around 6am.
    “”"

Combine the directive with the final message and the cleaned content

user_input = f"{contentAvoidanceDirective}\n\nChat History:\n{cleanedContent}\nCURRENT QUESTION:\n{finalMessage}"

messages = [
{“role”: “system”, “content”: “You are a helpful assistant asked to interact based on the conversation’s context.”},
{“role”: “user”, “content”: user_input}
]

completion = client.chat.completions.create(
model=“gpt-3.5-turbo”,
messages=messages,
)

You can check my open source client which pass all needed history (and has much more features) GitHub - msveshnikov/allchat

i use a prompt builder that takes into account the token window size, then assigns 30% of that size to the “interactive chat history” then 30% for RAG from a vector index of all interactions, and finally 30% for a local storage of text files that the llm writes to disk if i tell
it to. this method gives the illusion of a deep memory and the rolling history prevents goldfishing.