Assistants - Embeddings and Vector Stores

In my c# .NET 8 program I have an e-mail assistant that fetches my mails into JSON.
The JSON looks like this

[
  {
    "Id": "123@test.de",
    "From": "TestUser1 <testuser1@test.de>",
    "To": "TestUser2 <testuser2@test.de>",
    "Date": "16.05.2024 - 08:41:02",
    "Subject": "Test Subject 1",
    "Body": "This is Test 1"
  },
  {
    "Id": "321@test.de",
    "From": "TestUser2 <testuser1@test.de>",
    "To": "TestUser1 <testuser2@test.de>",
    "Date": "16.05.2024 - 08:46:24",
    "Subject": "Test Subject 2",
    "Body": "This is Test 2"
  },
  ...
]

I would like to create a vector store that my assitant can use but im unsure on how to proceed.

Let me briefly explain the steps as I have thought of them:

  1. build an embedding from each e-mail
  2. upload each embedding as file (to get the file ids needed for step 3)
  3. create a vector store with the corresponding file ids

Is that the correct way on how to do it?

My goal is that my assistant can handle prompts like

Summarize the recent correnspondence between User X and User Y

Now to my questions:

Are the embeddings even required? Or do I just have to upload the JSON file as one to the vector store?

If the vector store is initialized can the assitant access it independently?

Please enlighten me, I have the feeling that I have made a mistake in my thinking

No, if you’re planning to use OpenAI Assistants, you don’t need to do the embeddings manually. As per OpenAI Documentation,

Once a file is added to a vector store, it’s automatically parsed, chunked, and embedded, made ready to be searched. Vector stores can be used across assistants and threads, simplifying file management and billing.

Once you add your files to vector stores, your assistant can directly search into it. Follow this video on OpenAI Assistant V2 if you like. Let me know if you have any questions.

1 Like

I am honestly a little confused if i even need a vector store or even an assistant for that.

My goal is to create a model that can answer any question about the provided data (in this case its e-mails)…
Example question: “What is the Subject of my newest e-mail?”

I am unsure if a vector store is the solution for that case as i can’t really imagine that the keyword “newest” is correctly interpreted in such a way that only the latest e-mail is actually being searched for.

Another factor is data protection… In practice, I would work with sensitive data. If I upload these files to openai, I honestly don’t know how secure the data is there

Correct. Let’s take example, that you pass your information and you want a reply from that information, which in your case is email thread. If you want to ask one question at a time you don’t need to use assistant. If in case, you want to have multiple Q/A sessions on that email thread, you add that thread into a file, and use openai assistant APIs on it. For this case Assistant is useful.

Now coming to your concern for data protection. In this digital world, you can’t trust anyone with your sensitive information but OpenAI has stated that, any data that you pass to OpenAI APIs as conversations, will not be used by them to train their model. They do store the data for 30 days but that depends on what model you’re using. Plus you can always ask OpenAI to make that data retention window to 0.

Here’s an article where I’ve covered all OpenAI Data Policies

So is it correct that for each new “E-Mail Session” i need to reupload the file because that seems a little redundant to me?

It would be great to just have a dynamic knowledge base in background.
For example i have a the json file with the mails stored on a server and every change i make to it (lets say adding new mails) is automatically new knowledge for my assistant?

That can be done. You updating that json file with new email messages and sending the updated file to OpenAI Vector Stores via API hit. Python or another language can do that for you.

But isn’t that very time and cost intensive?
I can imagine that uploading a JSON file with 5000+ data each time is very time consuming.

I am mainly interested in what would be “state of the art” in this case?

How would you solve that ? :stuck_out_tongue:

In this case, I would still use ChatCompletion instead of Assistant APIs. I would use a Model with 128k Context window. How big is your Email thread?

What do you mean? The size of the JSON?
In that case im currently running on test data where the json only contains 100 mails. But in practise the file has like 5000+ mails so round about 20MB

I’d recommend use Astra DB or some such external vector store + Assistant APIs

While rightly pointed out earlier, OpenAI does take care of chunking, embedding and vector search.

You can programatically add new email json to OpenAI files or external source for a decoupled scalable architecture.

Writing yourself a detailed app in a preferred framework (django or nextjs) that uses the power of assistant API via simple OpenAI SDKs, leverage features such as:

  • threading
  • runs
  • polling
  • actions
  • etc …

Source of truth remains an external vector store, that can have CI capabilities with your emails. Processing and query remains within your app via Assistant APIs

More or less this topic’s thread and your plan is on right lines. Tweak it, and do share it ahead.

I am currently trying to create a vector store using a postgresql database with pgvector extension…

I’ll let you know how it goes… If it turns out to be a shot in the dark, I’ll try my hand at Astra DB

1 Like

Ok here are my results and questions so far:

I created a postgresql database with pgvector extension with docker.

Here is the script:

#!/bin/bash

# Step 1: Run the Docker container
docker run -d --name postgres-pgvector -p 5432:5432 -e POSTGRES_PASSWORD=pass123 ankane/pgvector

# Wait for the PostgreSQL server to start
echo "Waiting for PostgreSQL to start..."
sleep 3

# Step 2: Create the database
docker exec -u postgres postgres-pgvector psql -c "CREATE DATABASE db_openai_postgres_pgvector;"

# Step 3: Connect to the new database and create the extension
docker exec -u postgres postgres-pgvector psql -d db_openai_postgres_pgvector -c "CREATE EXTENSION vector;"
echo "PostgreSQL with pgvector extension is set up successfully."

As i am using c# .net 8 I used semantic kernel to store my data.
Inspired by the docs I loop my mails and save that mail to the database.

foreach (var email in emails)
{
        await memory.SaveInformationAsync(
            id: email.Id,
            collection: "emails_test_collection_info",
            text: $@"Date: {email.Date}
Sender: {email.From}
Receiver: {email.To}
Subject: {email.Subject}
Body: {email.Body}");
}

What this does is to create an entry in the table emails_test_collection_info with the following columns:

key metadata embedding
123@test.de {“id”: “123@test.de”, “text”: “Date: 16.05.2024 - 08:41:02\r\nSender: TestUser1 testuser1@test.de\r\nReceiver: TestUser2 testuser2@test.de\r\nSubject: Test Subject 1\r\nBody: This is Test 1”, “description”: “”, “is_reference”: false, “additional_metadata”: “”, “external_source_name”: “”} [-0.014717411,0.0028034775,0.016732939,…]
321@test.de {“id”: “123@test.de”, “text”: “Date: 16.05.2024 - 08:46:24\r\nSender: TestUser2 testuser2@test.de\r\nReceiver: TestUser1 testuser1@test.de\r\nSubject: Test Subject 2\r\nBody: This is Test 2”, “description”: “”, “is_reference”: false, “additional_metadata”: “”, “external_source_name”: “”} [-0.014715183,0.011425522,0.0034270026,…]

Of course, these are only examples. In my test data, I use a few old company emails that are about easter or the sale of old tablets, for example.

Now I can also use this data without problems with queries like
Give me the mails where it's about easter/tablets.

However, you reach your limits here if you have queries like
What was my latest email about
or even something like
How many emails do I have in total.

Since the similiratity search only seems to work on content but not on logical operations such as “filter by date” or general data counting…

It must somehow be possible to give him data and he can then analyze this data in every conceivable way?

How do you solve something like that?