Load embedding from disk - Langchain Chroma DB

I am trying to follow the simple example provided by deeplearning.ai in their short course tutorial.
As per the tutorial following steps are performed

  1. load text
  2. split text
  3. Create embedding using OpenAI Embedding API
  4. Load the embedding into Chroma vector DB
  5. Save Chroma DB to disk

I am able to follow the above sequence.
Now I want to start from retrieving the saved embeddings from disk and then start with the question stuff, rather than process first 4 steps every time I run the program.

Here are snippets of code that I am using

vectordb = Chroma(persist_directory="embeddings\\")
print(vectordb._collection.count())

The above code prints 188 which means the data is present, but how do I make use of it. Using below code

docs = vectordb.similarity_search(question,k=3)

I get following error
You must provide embeddings or a function to compute them

Any help on how to define the function or suggest the langchain API about embeddings.

I’ve been struggling with this same issue the last week, and I’ve tried nearly everything but can’t get the vector store re-connected after script is shut-down, and then re-connection attempted from new script using same embeddings and persist dir.
I haven’t found much on the web, but from what I can tell a few others are struggling with same thing, and everybody says just go dig into the langchain source code to figure it out.
Wish someone would just give an answer others could leverage :frowning:

I just gave up on it, no time to solve this unfortunately.

The answer was in the tutorial only. Had to go through it multiple times and each line of code until I noticed it.
Here is what worked for me

from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings(openai_api_key=api_key)
db = Chroma(persist_directory="embeddings\\",embedding_function=embedding)

The embedding_function parameter accepts OpenAI embedding object that serves the purpose.

Hope this helps somebody

3 Likes

If the embeddings are already saved in the persist directory, then why do we need to mention the embedding again while loading the saved embeddings? Does it use the embedding function again?

db = Chroma(persist_directory="embeddings\\",embedding_function=embedding)

Hi sheena. You are right that the embedding function is used again. However, it is not used to embed the original documents again (They can be loaded from disc, as you already found out).

However, when you use the vectorstore to retrieve data that is relevant to a specific query, it is important that the query is embedded using the same embedding function as was used during embedding the original documents. That is why the Chroma Constructor expects the parameter embedding_function (embedding I think it is called in recent versions) here.

1 Like