Unclear information in Embedding Object when using "text-embedding-3-large" for Embedding Texts for RAG

Hello Community,

I was trying to implement the new Text Embedding Model “text-embedding-3-large” instead of the old “text-embedding-ada-002”.

Here is how the Code looked like before:

embedding = OpenAIEmbeddings(openai_api_key=api_key)
vectordb = Chroma.from_documents(documents=texts, embedding=embedding, persist_directory =pers_dir)
vectordb.persist()

Here is how it looks like now - only thing changed is the model since I specified “text-embedding-3-large” resulting in a 3072 Vector instead of a 1568 Vector:

embedding = OpenAIEmbeddings(openai_api_key=api_key,model="text-embedding-3-large")
vectordb = Chroma.from_documents(documents=texts, embedding=embedding, persist_directory =pers_dir)
vectordb.persist()

Then I inspected the vectordb Object using:
vectordb.embeddings.json()
and got the following result:
{"model": "text-embedding-3-large", "deployment": "text-embedding-ada-002", "openai_api_version": "", "openai_api_base": ...

The content for the model-key is correct, but what is the meaning of the deployment key ?

After that I inspected the length of the first Element/ Record to see how many columns it has which is basically the length of the embedding vector resulting from using the Embedding Model:
len(vectordb._collection.get(include=['embeddings'])["embeddings"][0])

The result is 3072, which is correct. But why does the value for the “deployment”-key differ from the one for the “model”-key when inspecting the vectordb-Object?

Thanks in advance for answers or inputs.

Hello, I migrated too but I had to switch back to ada-002 model because the vectorstore search was returning very high distances valuse compared to ada-002 model for the same ingested documents.and same query. It looks like the similarity score is messed up with the new model. I am using LC and Chroma, but I tried also only Chroma and the distances returned from semantic search are too differnt. ada-002 = 0.16 versus text-3-larce = 0.62 (a lot more distant). Tried text embeddd in English then I tried also Italian texts, but same result. Is Chroma the problem?

  1. the “deployment”: text value in the embeddings return object is undocumented.

To unravel the mystery of what it could mean: what multi-modal alternatives to text embeddings are there that OpenAI could deploy.

  1. The cosine/dot distances will be different on new models. The scale is more like one would expect, 1.0-0.0 instead of never going below 0.6. Subjects can also be more differentiated. You’ll need to adjust your relevance cutoff value.
2 Likes

Thanks for the explication, this should be a point of attention because many people realy on threshold in their RAG programs, so I will adjust the threshold expecting the lowest distances not going below 0.6 (similarity threshold 0.4)