Use embeddings to retrieve relevant context for AI assistant

This tutorial is a sequel to the original - Build your own AI assistant in 10 lines of code - Python:

In the previous tutorial we explored how to develop a simple chat assistant, accessible via the console, using the Chat Completions API.

In this sequel, we will solve the most asked question:

“How to conserve tokens and have a conversation beyond the context length of the Chat Completion Model?”

This post includes an easter egg for those who haven’t read the updated API reference.


Embeddings are a way of representing data as points in space where the locations of those points in space are semantically meaningful. For example, in natural language processing, words can be embedded as vectors(lists) of real numbers such that words that are semantically similar are located close together in the embedding space. This allows machine learning models to learn the relationships between words and to perform tasks such as text classification, sentiment analysis, and question answering.

In our case, embeddings provide a way to capture the meaning of text and enable us to find relevant messages based on their semantic similarity.


To follow along with this tutorial, you’ll need the following:

  • Python 3.7.1 or higher installed on your system
  • The Python client library for the OpenAI API v0.27.0 (latest version at the time of writing)
  • An OpenAI API key. If you don’t have one, sign up for the OpenAI API and get your API key.

Step 1: Set up the environment

Import the necessary libraries and set up the OpenAI API key. Make sure you have the openai and pandas libraries installed.

import openai
import json
from openai.embeddings_utils import distances_from_embeddings
import numpy as np
import csv
import pandas as pd
import os.path
import ast

openai.api_key = "YOUR_API_KEY"  # Replace "YOUR_API_KEY" with your actual API key

Step 2: Create functions to store messages and lookup context

Now, we will define two functions: store_message_to_file() and find_context().

The store_message_to_file() function will take message object, obtain embeddings for its 'content', and store the message object and the embeddings obtained from "text-embedding-ada-002" model to a file as csv. We are using "text-embedding-ada-002" because of its economy and performance.

# save message and embeddings to file
def store_message_to_file(file_path, messageObj):
    response = openai.Embedding.create(model="text-embedding-ada-002",
    emb_mess_pair: dict = {
        "embedding": json.dumps(response['data'][0]['embedding']), # type: ignore
        "message": json.dumps(messageObj)

    header = emb_mess_pair.keys()
    if os.path.isfile(file_path) and os.path.getsize(file_path) > 0:
        # File is not empty, append data
        with open(file_path, "a", newline="") as file:
            writer = csv.DictWriter(file, fieldnames=header)
        # File is empty, write headers and data
        with open(file_path, "w", newline="") as file:
            writer = csv.DictWriter(file, fieldnames=header)

The `find_context()` function will call the `store_message_to_file()` to store the user message along with its embedding. It will then take the embeddings from the file, calculate distances between the user's message embedding and previous message embeddings, and return a context message if context is found i.e. the messages which that are near enough.

In the following definition for find_context() I’m using the cityblock as the distance metric, but you can use other metrics based on your requirements.

# lookup context from file
def find_context(file_path, userMessageObj, option="both"):
    messageArray = []
    store_message_to_file(file_path, userMessageObj)
    if os.path.isfile(file_path) and os.path.getsize(file_path) > 0:
        with open(file_path, 'r') as file:
            df = pd.read_csv(file_path)
        df["embedding"] = df.embedding.apply(eval).apply(np.array) # type: ignore
        query_embedding = df["embedding"].values[-1]
        if option == "both":
            messageListEmbeddings = df["embedding"].values[:-3]

        elif option == "assistant":
            messageListEmbeddings = df.loc[df["message"].apply(
                lambda x: ast.literal_eval(x)['role'] == 'assistant'),
        elif option == "user":
            messageListEmbeddings = df.loc[df["message"].apply(
                lambda x: ast.literal_eval(x)["role"] == 'user'),
            return []  # Return an empty list if no context is found
        distances = distances_from_embeddings(query_embedding,
                                              messageListEmbeddings, # type: ignore
        mask = (np.array(distances) < 21.6)[np.argsort(distances)]
        messageArray = df["message"].iloc[np.argsort(distances)][mask]
        messageArray = [] if messageArray is None else messageArray[:4]
        messageObjects = [json.loads(message) for message in messageArray]
        contextValue = ""
        for mess in messageObjects:
            contextValue += f"{mess['name']}:{mess['content']}\n"
        contextMessage = [{
            f"{assistantName}'s knowledge: {contextValue} + Previous messages\nOnly answer next message."
        return contextMessage if len(contextValue) != 0 else []

Step 3: Initialize the conversation

Now, let’s initialize the conversation by setting up some initial variables and messages.

dir_path = "/path/to/your/directory/"  # Replace with the directory path where you want to store the file
file_path = dir_path + input("Enter the file to use: ")
username = "MasterChief"
assistantName = "Cortana"
systemName = "SWORD"

message = {
    "role": "user",
    "name": username,
    "content": input(f"This is the beginning of your chat with {assistantName}.\n\nYou:")

conversation = [{
    "role": "system",
    "name": systemName,
    "content": f"You are {assistantName}, a helpful assistant to {username}. Follow directives by {systemName}"

In this step, you need to provide the directory path where you want to store the file. Set the dir_path variable accordingly. The file_path variable is constructed by concatenating the directory path with the user-provided filename. The username, assistantName, and systemName variables can be customized to your preference.

We create the initial user message and system message to set the context for the conversation.

Step 4: Generate responses using Chat Completion API

Now, we can generate responses using the models on Chat Completions API.

We store responses from the model using store_message_to_file().

To retrieve context for the user message, we use the find_context() function, which also writes the user message and its embedding to the file.

These functions ensure that the conversation history is maintained and used to generate relevant responses.

fullResponse = ""

while message["content"] != "###":
    context = find_context(file_path, message, option="both")
    last = conversation[-2:] if len(conversation) > 2 else []
    conversation = conversation[0:1] + last + context
    print(f"{assistantName}: ")
    for line in openai.ChatCompletion.create(model="gpt-3.5-turbo-0613",
        token = getattr(line.choices[0].delta, "content", "")
        print(token, end="")
        fullResponse += token
    message["content"] = input("You: ")
    assistantResponse = {
        "role": "assistant",
        "name": assistantName,
        "content": fullResponse
    store_message_to_file(file_path, assistantResponse)
    fullResponse = ""

That’s it! You have now built a chat assistant using the Chat Completion Model. You can run the code and start interacting with the assistant in the console


Great stuff. Thanks for sharing with us. Hoping this helps someone out there!


Thanks @PaulBellow!

That’s the intent behind writing this. Hope you enjoyed reading it and hope the rest of community does as well.



Yes, this is enormously helpful. Thank you very much. I’m completely new to all of this, and I have a few questions.


What sort of files can you store your conversations in? :exploding_head:

store_message_to_file() / find_context()

This certainly is the million-dollar question. I’ve been thinking a lot about how to conserve tokens in conversations. It seems like there is an opportunity when you are storing the message.

What kind of file types can you store the conversation to? Is there some way to intelligently structure those files so that ChatGPT can more efficiently look up the information later?

So, lets say you have a long conversation on a given topic that would exceed your token limit. You’d summarize the output then save it. Does an embedding with find_context() know how to return to files its already made? Or can we help it along by providing more structure in the summaries provided.

What I am interested in is making it possible to have long conversations without loosing context about the earliest parts of the conversations. So I imagine ChatGPT intelligently retrieving summaries unless it needs more information.


Don’t embeddings result in biases in data retrieval? Is there a way to correct for these biases?

GPT 3.5 TURBO vs GPT 4.0

In your code you’re calling on gpt-3.5-turbo. Will any part of this code change significantly when you start using 4.0? Or does this just change the nature of the instructions you pass it? I understand, GPT4 understands conversation far better than 3.5. Which I’ve seen in action, and I think is responsible for a lot of the word-count questions getting asked in Prompting.

For example, I think I read somewhere that 3.5 has trouble with the “system message,” so will GPT4’s greater conversational capacity change how you can interact with it here? As in, only retrieving relevant summaries until it needs more information to try to keep within the token limit?


Flat text files.

Not really, that’s what the whole point of embeddings are. We’re basically encoding the message data as a point in a high-dimensional latent-space and we’re making the assumption that messages which encode to points which are near each other have a high degree of similarity and would therefore be contextually relevant to each other.

What you are describing is the concept of memory. For instance, I don’t have the lyrics to Call Me Maybe constantly in my working memory all day, every day. But, if I see or hear something it can trigger my brain to retrieve that information. Sometimes it will be relevant, sometimes it will not.

Embeddings are somewhat similar in that you’re not specifically querying a memory (it’s not a keyword-based database look-up—though some vector databases do also support that), rather it’s taking what you hear (in this case a prompt) and just seeing which, if any, memories are triggered by it.

Then, pulling those memories into your working memory and formulating a response with that new information.

1 Like

Currently this code writes to the specified file as csv, but you can alter the code to go with JSON or any other format that works for you.

For lookups at scale, you can use vector databases
I’m using csv for the fact that it allows me to append new conversations to the file without having to read the file, which same time and compute, and of course the code I wrote can be further optimized for speed.

This code doesn’t summarize the conversation, it just stores every message along with embeddings and retrieved the most relevant ones as specified by the mask you can write own criteria for choosing the conversation.
You can add more code to it. e.g If the relevant messages were to consume too much tokens, they can be summarized before being sent for completion.
IMO this is not the only or the best approach, it’s a demo for learners to be acquainted with embedding for conversational context.

I’m not sure what you mean by bias in this context. As @elmstedt shared, embeddings are used to filter out semantically relevant info from the rest of content.

The code is written for Chat Completions API, which means it can consume any model as of now that can be accessed over the chat completions API. You can simply change the name of the model in the API call to the chat model you want to consume.

You aren’t wrong here, but that was for gpt-3.5-turbo-0301 the latest 3.5 model released on 0613 is better at understanding context and following system message.
In fact,

However you can use gpt-4 if you still want to, no changes required in the code, except the model name.


Thank you thats super helpful!
I was wondering, how exactly does the billing for embeddings work? Is it based purely on the size of the documents or the token count that I’m dealing with? Also if I keep running the code and constantly give in new input to search for , does it mean the model keeps making fresh embeddings? Like, am I getting billed each time the program runs or does the embedding for the documents work one time and then the program only runs fresh embeddings for the input? - Thanks alot !

Welcome to the OpenAI community @lassebremer1312

Embeddings are billed by token count. Pricing varies by model.

Embeddings once received are stored with the respective message as csv. Whenever you send a new message, embeddings only for that message are retrieved and compared against the stored embeddings to retrieve semantically relevant message(s).


Thanks for this SPS! I’ll give this a go, although I have to admit, conceptually positioning words in multi-dimensional space to assign meaning to them currently makes my brain want to explode…

1 Like

Okay, first off, thank you for sharing this. I do have a few questions though as I am struggling to wrap my head around your goals of this code. After reading your first article on “Build your own AI assistant in 10 lines of code - Python”, my assumption is you are trying to reduce the number of messages (history/context) sent to OpenAI’s API interface when a new query is entered. As I understand it, the API has no memory, so every new query requires you to send all of the history each time. Is that true?

What is the purpose of creating a new file "input("Enter the file to use: “)” each time when starting the program? Is the idea that you create new files for different topics? Or is it to reduce file sizes when appending and searching to avoid a single massive history file?

Thanks again, for you turtorial.

1 Like

Welcome to the OpenAI community @jmostl

Yes, but not the entire history, only the relevant history.

It doesn’t create a new file everytime. if the file exists, it appends to previous data. It serves a number of purposes:

  • Store conversations for persistence. Without it, you’d lose all the conversation if the code is stopped.
  • segment conversations. You can have conversations with assistant on multiple subjects, based on the filename. e.g. one for cooking, one for code or any specific subject.
  • Yes it also prevents eating up a lot of compute to find semantically relevant messages from a giant heap of messages.

You don’t have to use OpenAI embeddings. You can use hugging face or tensorflow (or tensorflowjs if using node - or I’m sure there are others. These models are good and free to a point. Just make sure you use the same embedding model for you query translation.

1 Like