Vector store: is an upsert operation possible?

Is it possible to upsert when adding files to a vector store (as in collection.upsert() for Chroma)?

Can one even name the files in the vector store in order to determine whether the file already exists in the vector store and should be updated instead of a new (redundant) record created? I cannot determine whether this is possible, based on the API docs; it appears that file names (file IDs) are automatically assigned. For example, from the quick start docs:

# Create a vector store caled "Financial Statements"
vector_store = client.beta.vector_stores.create(name="Financial Statements")
 
# Ready the files for upload to OpenAI 
file_paths = ["edgar/goog-10k.pdf", "edgar/brka-10k.txt"]
file_streams = [open(path, "rb") for path in file_paths]
 
# Use the upload and poll SDK helper to upload the files, add them to the vector store,
# and poll the status of the file batch for completion.
file_batch = client.beta.vector_stores.file_batches.upload_and_poll(
  vector_store_id=vector_store.id, files=file_streams
)

…so it appears that the user cannot assign names to the files so that if those identically named files are uploaded again to the vector store, they will be updated instead of a new, redundant record is added to the vector store.

If you’re using Chroma for vector store, it shouldn’t be saving each individual file in the persist directory - rather, it should chunk and embed each file you’ve given it, then add that to the persist directory.

For a basic ingestion function (this one assumes a folder of .txt files), this works well:

def persistent_ingest(content_list, persistent_knowledge):
    #set up text splitter
    chunker = CharacterTextSplitter(chunk_size = 512, chunk_overlap = 128)
    embeddings = OpenAIEmbeddings()
    #list of all chunks from all documents in the folder
    all_chunks = []
    #loop through documents, chunk them, add to all_chunks
    for item in content_list:
        contents = chunker.split_documents(item)
        all_chunks.extend(contents)
    #embed all_chunks using Chroma, then save it to disk as folder with name = persistent knowledge
    vectordb = Chroma.from_documents(documents = all_chunks, embedding = embeddings, persist_directory = persistent_knowledge)
    vectordb.persist()

Then, in order to retrieve the contents of the persist directory, you can use something as simple as this:

embeddings = OpenAIEmbeddings()
vectordb = Chroma(persist_directory=persistent_directory, embedding_function=embeddings)

Note that the persist directory doesn’t contain .txt files, but .sqlite3 and .bin files. Here’s a picture of what a persist directory I’ve set up looks like:

Screenshot 2024-04-22 at 10.43.50 AM

Because of this, I don’t think you’d want to save a file with a specific name in the vector store, since the vector store isn’t directly saving your files, it’s saving the embeddings that have been created based on your files.

Perhaps a better method would be to perform a similarity search on the existing persist directory to see if a close match already exists for the contents of your file. This will let you check if the persist directory already contains the info you’re trying to supply it (and, if it doesn’t then you can add this info).

Thanks @ian.poe. I’m not using Chroma; I want to know if upsert is available (or at least possible to code) for the OpenAI API vector store.

Unavailable. You must code such a feature with your own tracking.

The files uploaded to assistant blob storage have a unique file ID. In the storage system, files also have a file name as part of the return object if you list files (which requires you to list all org files.)

When uploading and attaching, there is no enforcement of unique file names or hashing of duplicate or dissimilar files for you. You can upload a file 100 times, each with a unique ID.

Unlike the source assistant files uploaded to storage with the original filename available, the vector store endpoint only refers to those files by their ID. There is no working your way back to a vector store’s original file name unless you 1. get the entire files list in vector store, 2. entire file list for the organization, and then match them up.

Then it would be removing the old file, deleting the old file, uploading the new file, attaching the new file, polling for readiness if needed.

This means you really need to have your own database to maintain the purpose and association of these files with a user, a vector store, or a task, and then to decide who has delete-and-replace rights on the existing file just because of file name collision.

1 Like

Upsert works great on my local PG Vector table in Postgres :slight_smile: