Inconsistent Embedding Results for my dataset

I am using OpenAI’s text-embedding-3-large model to generate embeddings for a dataset of 3000 questions and their corresponding answers. The process works well for the first 2000 questions, where the embeddings generated are accurate and map correctly to the respective answers. However, after processing around 2000 questions, I begin to see a significant loss in integrity. The model starts generating incorrect answers for the questions, even when the questions from the dataset are repeated exactly.

1 Like

You say “the model starts generating incorrect answers”, but you have to look more at the results being returned by the vector database and its ranker that provides that grounding. Research the similarity scores and the top ranked results to observe what the input text you are obtaining (whether by user input or by AI function or rewriting) returns from the vector database.

It is elementary that the larger the corpus, the more potential returns that are not the target. A semantic similarity works on matching two strings with a language AI’s understanding of many aspects of similarity that are not “is this the answer”, so you can also embed just that portion that will look like a user input, or create a preliminary answer for a user question that looks like the embedded search text.

If you want more keyword-like searching, you can incorporate traditional search as something that will pare down the results before or after embeddings or re-rank them. Or you can integrate this into the weighting when you are doing your own computations…


1. Calculate the cosine similarity score based on vector embeddings

cosine_similarity = calculate_cosine_similarity(a_vector, b_vector) # range: 0 - 1

2. Calculate the fuzzy similarity score between the two strings

fuzzy_score = calculate_fuzzy_similarity(a_string, b_string) # range: 0 - 100 or 0 - 1 (depends on method)

If fuzzy_score range is 0 - 100 (e.g., FuzzyWuzzy), normalize it to a 0 - 1 range

fuzzy_score /= 100

3. Interpolate the combined similarity based on the weight

weight = 0.75 # 0-1 where 1 is only embeddings
combined_similarity = (weight * cosine_similarity) + ((1 - weight) * fuzzy_score)

Output the final interpolated similarity score

return combined_similarity

Using one of several fuzzy algorithmic search libraries that return a scoring.

So depending on what you mean by “after processing 2000 questions”, there may be other techniques to improve the quality of the stimulus to its matching desired grounding chunks.

1 Like