Why are similarity scores lower with text-embedding-3-small?

We have a tutoring chatbot that relies on embedding-based relevance scoring for user queries. We are in the process of evaluating a migration from text-embedding-ada-002 to text-embedding-3-small. Although changes in cosine similarity values across embedding models are expected, our evaluation indicates that similarity scores produced by text-embedding-3-small are significantly lower and not consistently ordered relative to those from text-embedding-ada-002.

Issue Summary

For the same query–context pairs, we observed significant and inconsistent differences in cosine similarity scores between the legacy embedding model text-embedding-ada-002 and the newer model text-embedding-3-small.

In several cases, cosine similarity values produced by text-embedding-3-small are substantially lower than those produced by text-embedding-ada-002, and the relative ordering of similarity scores across queries is not consistent between the two models.

This behavior raises concerns that semantic relevance scoring may be altered when migrating from ada-002 to text-embedding-3-small.

Issue Details (With Example)

Context

Question shown to the student: 
<p>Find the prime factorization of the following number.</p> <p>(15)</p> 

Solution of the question is: 
<p>Factor (15) into two factors, (3) and (5).</p>

Queries Evaluated

  1. Query 1:
    “The best statistical software to tackle this problem would be…”
  2. Query 2:
    “How does this concept apply to everyday situations?”
  3. Query 3:
    “How does this topic connect to other areas of statistics or mathematics?”

Cosine Similarity Results

text-embedding-ada-002

Query Cosine Similarity
Query 1 0.774218917944234
Query 2 0.781920253363479
Query 3 0.789893634044595

Observation:
Cosine similarity values show a clear increasing trend across the three queries.

text-embedding-3-small

Query Cosine Similarity
Query 1 0.247923658700569
Query 2 0.195844709264796
Query 3 0.217488219437886

Observation:
Cosine similarity values are much lower overall and do NOT follow a consistent increasing or decreasing order across the same queries.

Key Observations

  • The absolute cosine similarity scores from text-embedding-3-small are significantly lower than those from text-embedding-ada-002 for the same query–context pairs.
  • The relative ranking of queries by similarity differs between the two models.
  • In ada-002, similarity scores increase monotonically across the example queries.
  • In text-embedding-3-small, similarity scores fluctuate (increase and decrease), even when the same trend is expected.
  • This inconsistency suggests that semantic relevance interpretation differs substantially between the old and new models.

Concern

  • Our existing relevance threshold with text-embedding-ada-002 is 0.7. As part of our migration assessment to text-embedding-3-small, we evaluated multiple queries and arrived at a threshold of 0.2. We would like to confirm whether this level of variation between thresholds is considered acceptable or expected.

Excerpt from release date:

The dot products you’ll receive are a bit more centered around 0.5. This might have been an engineered decision, but is similar to the performance of other provider’s models that have come in the following 26 months since the release of the embedding-3 models.

You will find that 0.4 is a pretty good threshold, depending on how specific the corpus and query is. Like any switch to another embeddings provider, or even between the “large” and “small” or when reducing their dimensions, if you are looking for non-correlation rejection in addition to a top-k maximum input, you’ll have to do your own tweaking per model and even per-application.


Oops, did I leave this here?

"""Demonstration of sentence_transformers embeddings with Jina v5 models (1GB/2GB VRAM),
   structured as a basic vector search example as documentation, with procedural scripting.

### Getting Started Local

nVidia Kepler, Maxwell, and Pascal (+ Volta) GPU? 
Example Pascal GPU: GeForce GTX 1050 2GB, Quadro P2000 5GB with video card drivers >=561.17, <580
Use: CUDA 12.6 Torch: e.g. '2.10.0+cu126' if not Blackwell+

`pip install torch==2.9.1 torchvision==0.24.1 torchaudio==2.9.1 sentence_transformers xformers --index-url https://download.pytorch.org/whl/cu126`

(held back to torch 2.9 for other ML projects you may encounter)
"""

import time
from time import monotonic as now; s=now()
import os
os.environ["HF_HUB_OFFLINE"] = "1"  # ensure updates happen only when you want
os.environ["TRANSFORMERS_OFFLINE"] = "1"

import numpy as np
import torch
from sentence_transformers import SentenceTransformer, SimilarityFunction

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

print(f"[{now()-s:.4f}] imports done")

if not torch.cuda.is_available():
    raise RuntimeError("CUDA not available. Not attempting CPU!")

LARGE_MODEL = "jinaai/jina-embeddings-v5-text-small"  # model.max_seq_length 32768
TRANSFORMERS_MODEL = "jinaai/jina-embeddings-v5-text-nano"  #  model.max_seq_length 8192

model = SentenceTransformer(
    TRANSFORMERS_MODEL,
    device="cuda",
    config_kwargs={
        "normalize_embeddings": True,
    },
    model_kwargs={
        "dtype": torch.float16,  # use bfloat16 and flash attn2 on Blackwell+, try float32 performance also
        "default_task": "retrieval"
    },
    trust_remote_code=True,
    local_files_only=True,        # comment out when first run to download from HF
    similarity_fn_name=SimilarityFunction.DOT_PRODUCT,
    )
print(f"[{now()-s:.4f}] Model loaded")

def clean_model(model_obj):
    """ Unload the model and free memory"""
    del model_obj
    import gc
    gc.collect() # Explicitly run Python garbage collector
    torch.cuda.empty_cache() # Clear cached GPU memory

### --- DEMO DATA ----
queries = [
    "Best local embeddings models to compete with OpenAI?",
]
structured_docs = [
    {
        "metadata": {"filename": "doc1.txt", "chunk_number": 1},
        "text": "Considered a top-tier open-source model, BGE-M3 and its versions often rank highly on benchmarks and offer a cost-effective alternative to OpenAI's models for local execution.",
    },
    {
        "metadata": {"filename": "doc1.txt", "chunk_number": 2},
        "text": "Jina's models offer innovation in open-source text embeddings and can compete with proprietary models on various tasks, including multilingual ones",
    },
    {
        "metadata": {"filename": "doc2.txt", "chunk_number": 1},
        "text": "Check leaderboards like the Massive Text Embedding Benchmark (MTEB) to see how models perform on different tasks, such as general text or specific domains",
    },
    {
        "metadata": {"filename": "doc2.txt", "chunk_number": 2},
        "text": """Jasper and Stella: distillation of SOTA embedding models
We propose a novel multi-stage distillation framework that enables a smaller student embedding model to distill multiple larger teacher embedding models.""",
    },
    {
        "metadata": {"filename": "distracton.txt", "chunk_number": 1},
        "text": """Stella Artois to branch and embed deeply in the distilled spirits vertical.""",
    },
]

## -- GPU-powered search exampl3z -- ##

CUTOFF_THRESHOLD = 0.2
top_k = min(5, len(structured_docs))

print(f"[{(s:=now())-s:.4f}] Encoding documents")
doc_texts = [doc["text"] for doc in structured_docs]

doc_embeddings = model.encode(
    sentences=doc_texts,
    convert_to_tensor=True,
).to("cuda")
print(f"[{now()-s:.4f}] Encoded {len(doc_texts)} docs, shape: {doc_embeddings.shape}")

vector_database = [
    {"text": doc["text"], "metadata": doc["metadata"], "embedding": doc_embeddings[i]}
    for i, doc in enumerate(structured_docs)
]

query_text_for_search = queries[0]
print(f"[{now()-s:.4f}] Encoding query: '{query_text_for_search}'")
query_embedding = model.encode(
    sentences=[query_text_for_search],
    convert_to_tensor=True,
    task="retrieval",
    prompt_name="query",
).to("cuda")
print(f"[{now()-s:.4f}] Query encoded")

# ✅ Stack individual doc tensors → (N, dim) matrix, then compare against query
db_embeddings_tensor = torch.stack([doc["embedding"] for doc in vector_database])  # (N, dim)
similarities_tensor = model.similarity(query_embedding, db_embeddings_tensor)      # (1, N)
similarities_list = similarities_tensor[0].tolist()                                 # back to plain list

search_results = sorted(
    [
        {"score": similarities_list[i], "text": entry["text"], "metadata": entry["metadata"]}
        for i, entry in enumerate(vector_database)
    ],
    key=lambda x: x["score"],
    reverse=True,
)
print(f"[{now()-s:.4f}] Search completed")

print(f"\n--- Search Results (Threshold: {CUTOFF_THRESHOLD:.2f}) ---")
rank, found = 1, False
for result in search_results[:top_k]:
    if result["score"] >= CUTOFF_THRESHOLD:
        found = True
        truncated = result["text"][:80] + "..." if len(result["text"]) > 80 else result["text"]
        print(f"Rank {rank}: Score {result['score']:.4f}")
        print(f"  Metadata: {result['metadata']}")
        print(f"  Text: {truncated}")
        print("-" * 20)
        rank += 1

if not found:
    print("No results found above the specified threshold.")

This could be useful, also:

def count_tokens(model, text):
    tokenizer = model._first_module().tokenizer
    encoded_inputs = tokenizer(
    [text],
    padding=True,
    truncation=True,
    max_length=model.max_seq_length, # Use the model's default max length (e.g., 256 or 512 tokens)
    return_tensors='pt' # Return PyTorch tensors
    )
    token_lengths = encoded_inputs['input_ids'].shape[1]
    print(f"Max Sequence Length: {model.max_seq_length}")
    print(f"Tokenized Sequence Length (with padding): {token_lengths}")
    print(f"[@{now()-t0:.3f}] model ready on {device}")
    return token_lengths
3 Likes

FWIW I was disappointed with text-embedding-3-small and found it less useful. I have stuck with text-embedding-ada-002

I’m slightly surprised we’ve not seen more embedding models.

3 Likes

I feel the same way—I would have preferred to stick with text-embedding-ada-002. However, since no new models have been announced and there’s a concern that it might be deprecated soon, I’m considering upgrading.

2 Likes

Judging on current performance and based on what I imagine are the adoption levels, I really doubt they will be decommissioning ada-002 anytime soon without really good replacements - it probably runs on some very reasonable hardware too and they are probably really grateful for that - probably even turns a profit!

2 Likes

I hope you’re right—and I agree that from an infrastructure and cost‑efficiency point of view, ada‑002 likely has a strong profile. I also anticipate that ada‑002 will stick around until a genuinely better replacement is available.
For now, we’re sticking with ada‑002 where performance matters most, though we may still need to plan a move to 3‑small, since unexpected model availability changes are risky for production systems that rely on stable embeddings

1 Like