Embeddings results using Ada-Embedding-data-002

I am trying to run Q/A using embeddings as recommended by OpenAI at Question answering using embeddings-based search | OpenAI Cookbook

I am using the Ada-Embedding-data-002 GPT model for embeddings.

I couldn’t really get this working.
I have a text for context whose embedding doesn’t match the question embedding even though the answer is there.

For example, Context Paragraph:

2 ABOUT THE AUTHOR The common thread running through Allen Carr’s work is the removal of fear. Indeed, his genius lies in eliminating the phobias and anxieties which prevent people from being able to enjoy life to the full, as his bestselling books Allen Carr’s Easy Way to Stop Smoking, The Only Way to Stop Smoking Permanently, Allen Carr’s Easyweigh to Lose Weight, How to Stop Your Child Smoking, and now The Easy Way to Enjoy Flying, vividly demonstrate. A successful accountant, Allen Carr’s hundred-cigarettes-a-day addiction was driving him to despair until, in 1983, after countless failed attempts to quit, he finally discovered what the world had been waiting for —the Easy Way to Stop Smoking. He has now built a network of clinics that span the globe and has a phenomenal reputation for success in helping smokers to quit. His books have been published in over twenty different languages and video, audio and CD-ROM versions of his method are also available. Tens of thousands of people have attended Allen Carr’s clinics where, with a success rate of over 95%. he guarantees that you will find it easy to quit smoking or your money back. A full list of clinics appears in the back of this book. Should you require any assistance do not hesitate to contact your nearest therapist. Weight-control sessions are now offered at a selection of these clinics. A full corporate service is also available enabling companies to implement no-smoking policies simply and effectively. All correspondence and enquiries about ALLEN CARR’S BOOKS, VIDEOS, AUDIO TAPES AND CD-ROMS should be addressed to the London Clinic.

And Question is: what do we know about the author? What is his background?

I attempted to calculate similarity using Distance.cosine, Distance.Manhattan and Distance.Euclidean approaches. Despite going over all the paragraphs provided, they chose the irrelevant ones and the similarity with the most appropriate paragraph was quite low. For example, the relevant paragraph above was 155th on the list out of 163 paragraphs in total.

Any idea?

1 Like

I’m using cosine on my searches and its workign well, is this similar to your method?

def cosine_similarity(a, b):
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    similarity = dot_product / (norm_a * norm_b)
    return similarity
1 Like

Open AI lib provides it’s own cosine distance calculator as well

from openai.embeddings_utils import get_embedding, cosine_similarity.

The open ai docs have a samples on how to configure it. It might be that the code you have written might not be in line with what they recommend and that might the cause of your issue.

2 Likes

No, I wasn’t using this. I was using the Math library in C# (Mathnet.Numerics.Distance) to compute the Distance.cosine.

Now, I have tried yours and it is quite close now. The relevancy of the paragraph has jumped to position 3 out of 163. :grinning:

It looks like the library I was using to compute was the issue. Thank you.

1 Like

Oh okay. I didn’t realise that :grinning:. Thanks.

The issue was with the method I was using to compute the similarity between the two vectors.

1 Like

OpenAI’s dot product should work fine. But I see that your chunk size is HUGE - and that is going to affect your relevancy. (PS: If you use something like Pinecone, their query function takes care of the dot product for you)

1 Like

That’s excellent, glad I could help!

I have no idea :(. I used to think that a fairly large chunk might have more context information. Now I am interested to try a small chunk size. What should be the ideal chunk size, Is there any standard in terms of words or lines or paragraphs? Thanks.

Hey @Not_Wrogn. I shared some insights here: Embedding - text length vs accuracy? - #5 by AgusPG. The two-steps semantic search approach works reasonably well to deal with the long/short chunks dilemma. Hope it helps! :slight_smile:

2 Likes

Hi @AgusPG, that’s very clever. I think it should work. I am going to copy that approach. Thanks for guiding :slight_smile:

1 Like

In their reference implementation, OpenAI seems to suggest 200 tokens. See: https://github.com/openai/chatgpt-retrieval-plugin/blob/main/services/chunks.py - but this is something you will need to play with (given your use case).

I think the two-step approach by @AgusPG is much better - rather than a static value.

2 Likes