Inconsistent Embedding Results for my dataset

madhankumar200205 · November 14, 2024, 1:15pm

I am using OpenAI’s text-embedding-3-large model to generate embeddings for a dataset of 3000 questions and their corresponding answers. The process works well for the first 2000 questions, where the embeddings generated are accurate and map correctly to the respective answers. However, after processing around 2000 questions, I begin to see a significant loss in integrity. The model starts generating incorrect answers for the questions, even when the questions from the dataset are repeated exactly.

_j · November 14, 2024, 2:08pm

You say “the model starts generating incorrect answers”, but you have to look more at the results being returned by the vector database and its ranker that provides that grounding. Research the similarity scores and the top ranked results to observe what the input text you are obtaining (whether by user input or by AI function or rewriting) returns from the vector database.

It is elementary that the larger the corpus, the more potential returns that are not the target. A semantic similarity works on matching two strings with a language AI’s understanding of many aspects of similarity that are not “is this the answer”, so you can also embed just that portion that will look like a user input, or create a preliminary answer for a user question that looks like the embedded search text.

If you want more keyword-like searching, you can incorporate traditional search as something that will pare down the results before or after embeddings or re-rank them. Or you can integrate this into the weighting when you are doing your own computations…

1. Calculate the cosine similarity score based on vector embeddings

cosine_similarity = calculate_cosine_similarity(a_vector, b_vector) # range: 0 - 1

2. Calculate the fuzzy similarity score between the two strings

fuzzy_score = calculate_fuzzy_similarity(a_string, b_string) # range: 0 - 100 or 0 - 1 (depends on method)

If fuzzy_score range is 0 - 100 (e.g., FuzzyWuzzy), normalize it to a 0 - 1 range

fuzzy_score /= 100

3. Interpolate the combined similarity based on the weight

weight = 0.75 # 0-1 where 1 is only embeddings
combined_similarity = (weight * cosine_similarity) + ((1 - weight) * fuzzy_score)

Output the final interpolated similarity score

return combined_similarity

Using one of several fuzzy algorithmic search libraries that return a scoring.

So depending on what you mean by “after processing 2000 questions”, there may be other techniques to improve the quality of the stimulus to its matching desired grounding chunks.

Topic		Replies	Views
Matching irrelevant embedding vectors when given a question API	0	508	June 20, 2023
Semantic search through embeddings API	3	1238	January 22, 2023
Question on Embedding - Embedding Length is uniform? API	4	718	December 18, 2023
Use embeddings to measures how well an answer fits the question API embeddings	5	218	June 29, 2024
Document Retrieval in Large Database Community embeddings	4	3597	October 27, 2024