Some questions about text-embedding-ada-002’s embedding

We want to use the embedding generated by the text-embedding-ada-002 model for some search operations in our business, but we encountered a problem when using it. Here are two texts.

text1: I need to solve the problem with money
text2: Anything you would like to share?

following is the code:

emb = openai.Embedding.create(input=[text1,text2], engine=model,request_timeout =3)
emb1 = np.asarray(emb.data[0]["embedding”])
emb2 = np.asarray(emb.data[1]["embedding”])
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
score = cosine_similarity(emb1,emb2)
# score:0.7486107694309302

Semantically these two sentences are more different, my question is that their semantic similarity should be more different.
we also test this two sentence’s similarity on the model ''all-MiniLM-L6-v2’’ in HuggingFace ,the score is 0.02920079231262207.

I really need some help about how to use the embedding of the embedding model ‘text-embedding-ada-002’

10 Likes

The engine doesn’t seem to have a wide angular distribution from my experience. And I have no idea why either. I posted another thread on this last week. I don’t think anyone knows why, so I was thinking of doing a deeper dive.

As background, I embedded 80k random texts and phrases. If I pick one at random, and run a vector search on the top 10 most opposite texts, I get cosine similarities similar the the one you had, which is around 45 degrees. It totally had me wondering if the model has much of its vector space dedicated to other things.

Now one thing in my case, and maybe it is the same for you, is all the texts are relatively short. If it covers more area of the vector space based on length of the text, this might explain it.

But intuitively, you would think it embeds on semantic similarity, not length. However, when looking at the closest texts, it will return things that are related, but the sentiment could be opposite. This isn’t a big deal to me, but I found that one interesting too.

3 Likes

I think what might be confusing about OpenAI embeddings is that the embedding vector for a phrase like “Anything you would like to share?” is based on an OpenAI model derived from text on the global internet. The same is true for the embedded vector for “I need to solve the problem with money”, the vector is derived from the OpenAI AAN combined with a particular training model.

The embeddings (vectors) are not based on a direct analysis of text, but on the OpenAI model based on the huge dataset used in the ANN. This is, at least, my current understanding.

So, using some Ruby code I cobbled together (using my own cosine similarity function, not from a library), let’s look at this:

irb(main):013:0> Embeddings.test_strings("I need to solve the problem with money","Anything you would like to share?")
=> 0.7614775318811315

irb(main):014:0> Embeddings.test_strings("I need to solve the problem with money","What is your financial situation?")
=> 0.8475256263838489

irb(main):015:0> Embeddings.test_strings("I need to solve the problem with money","Fraud")
=> 0.7632965853455049

irb(main):016:0> Embeddings.test_strings("I need to solve the problem with money","CitiBank")
=> 0.7823379047316411

If we rank these, the most similar are, in descending order:

  1. “What is your financial situation?”
  2. “CitiBank”
  3. “Fraud”
  4. “Anything you would like to share?”

These makes perfect sense to me, as being similar to “I need to solve the problem with money”.

So, based on what we might expect to see on the global internet, the above cosine similarities of embeddings vectors based on the text-embedding-ada-002 seems normal to me.

4 Likes

The embeddings make sense from a clustering standpoint, sure.

But what is weird about them, and as your example shows, is that your cosine similarity only varies from about 0.7 to 1. So its overall range is 0.3. It should range from -1 to +1 – an overall range of 2. So it is using only 15% of its range!

This became obvious when I was looking at opposite matches (cosine similarity of -1) or orthogonal matches (cosine similarity of 0). The most dis-similar it got was 0.7, which intuitively is still correlated (still positive).

It’s probably not that big of a deal, since there should still be plenty of dynamic range in the floats used in the vectors, but don’t expect opposite or orthogonal results in the traditional mathematical sense from this embedding space.

2 Likes

Yeah,

I was also disappointed in the lack of dynamic range.

Even when I made wild comparisons the dynamic range was limited and not satisfying.

2 Likes

Look at these examples (third param is the method, in this case the dot product). Objectively speaking, these “embeddings” seem to have ranging issues:

irb(main):002:0> Embeddings.test_strings("dog", "god", "dot")
=> 0.8552966723422395
irb(main):003:0> Embeddings.test_strings("dog", "quantum", "dot")
=> 0.8310934626953179
irb(main):004:0> Embeddings.test_strings("cat", "quantum", "dot")
=> 0.8109278089334362
irb(main):005:0> Embeddings.test_strings("cat", "rock climbing", "dot")
=> 0.7889013858753511
irb(main):006:0> Embeddings.test_strings("nebula", "rock climbing", "dot")
=> 0.7940156679628004
irb(main):007:0> Embeddings.test_strings("open water diver", "dead monkey", "dot")
=> 0.812577906739496
irb(main):008:0> Embeddings.test_strings("dlkhalkjlk bhalkdkjfdlk blahdkfhdklsflk", "blah blah", "dot")
=> 0.8374675716884794
irb(main):010:0> Embeddings.test_strings("dlkhalkjlk bhalkdkjfdlk blahdkfhdklsflk", "z", "dot")
=> 0.675489759381021
2 Likes

Just testing, I found the euclidean distance provides an interesting (larger) dynamic range comparing these OpenAI embedding vectors, where the larger the results, the more similar the vectors:

irb(main):007:0> Embeddings.test_strings("dog", "god", "euclidean")
=> 0.5379652623951475
irb(main):008:0>  Embeddings.test_strings("dog", "quantum", "euclidean")
=> 0.5812168842546647
irb(main):014:0>  Embeddings.test_strings("dog", "planetoid", "euclidean")
=> 0.6045222548292233
irb(main):015:0>  Embeddings.test_strings("dog", "puppy", "euclidean")
=> 0.3872980050569614
irb(main):016:0>  Embeddings.test_strings("dog", "cat", "euclidean")
=> 0.40881582463070715
irb(main):017:0>  Embeddings.test_strings("dog", "quantum singularity", "euclidean")
=> 0.6801458005499244
irb(main):019:0>  Embeddings.test_strings("dog", "bird", "euclidean")
=> 0.4752054716384041
irb(main):020:0> Embeddings.test_strings("dlkhalkjlk bhalkdkjfdlk blahdkfhdklsflk", "z", "euclidean")
=> 0.8056180961698471

Seems, with first glance, using the Euclidian Distance may be a useful alternative to the Dot Product and the Cosine Similarity functions when comparing OpenAI Embeddings Vectors.

Still researching and testing …


   def self.test_strings(string1,string2,method='dot')
      client = get_client
      output = 0.0
      return "Client Not Available" unless client.present? 
      return "Strings Not Available, String1=#{string1}, String2=#{string2}" unless (string1.present? && string2.present?)
      response1 = get_vector(client,string1)
      response2 = get_vector(client,string2)
      if method == 'dot'
         output = dot_product(response1, response2)
      elsif method ==  'euclidean'
         output = euclidean_distance(response1, response2)
      elsif method == 'angular'
         output = angular_distance(response1, response2)
      elsif method == 'manhattan'
         output = manhattan_distance(response1, response2)
      else # default to cosine
         output = cosine_similarity(response1, response2) 
      end
      output
   end

PS: Yes, the dot product, the cosine similarity and the angular distance provide the same results (within rounding errors), for unit vectors. Personally, after testing these simple words and phrases, the Euclidean distance seems promising, but need to test using a larger and more complex data set, of course.

See also, from ChattyGPT :slight_smile:

5 Likes

I love all the experiments that you are showing here. It seems that, even though there is more dynamic range using euclidean distance, at most it is 3 dB (factor of 2). The floats themselves are at 6b or 6*64 or at least 300 dB! Also, don’t forget the dot product acts as a low-pass filter and supplies additional resolution (dynamic range). But we are contained to 64 bits in most environments these days, so I’m gonna say 300 dB. Which is insane. Your phone might have 60 dB at best. It’s almost as if the dynamic range is a non-issue. But what drives me nuts is the geometrical interpretation! I feel like OpenAI is reserving the rest of the embedding space for other domains and not telling us. But in my test data I had all kinds of languages, Russian, Chinese, Emoji’s, etc and didn’t see much movement in the embedding.

So there is one conclusion, it must be some alien language they are holding back! :slight_smile: Seriously, I can’t figure this out. Others have reached out to me and can’t figure it out either. Something serious to learn here I feel. Either that or they messed up, either way it would be interesting. Especially since this the embedding model that replaces all previous models!

4 Likes

Thanks @curt.kennedy, experimenting and lab work is fun, for sure.

Funny, it’s a community for OpenAI developers, so like you, I am here for the “developing” and the “fun” :slight_smile: but the community seems awash in user customer support queries and complaints these days, haha

Maybe I’ll create a free public website where devs can test and compare these methods when I get my antique teak wood floors re-sanded and sealed. Time to put on the 3M mask and get to sanding.

Note: If the OpenAI staff would kindly grant me some (leader) privs here, I would create categories for customer support and try to help organize this great forum so the developers are not inundated with customer service requests (and vice-versa), as we see these days. It’s funny (or annoying, depending on how we look at it) how a majority of people here confuse ChatGPT with the API and vice-versa. We need mods / leaders to not be OpenAI staff so community members can change categories and keep this great community a bit more tidy…

Talk later, Curt.

3 Likes

Further testing over breakfast, at least for this simple testing comparing four methods with simple phrases, Euclidean Distance seems the best match for getting suitable dynamic range:

MacStudio:openai tim$ rails c
Loading development environment (Rails 7.0.4)
irb(main):001:0> require_relative('./lib/openai_embeddings')
=> true                                                             
irb(main):012:0> methods=["dot","cosine","euclidean","manhattan"]
=> ["dot", "cosine", "euclidean", "manhattan"]
irb(main):013:1* methods.each do |method|
irb(main):014:1*   Embeddings.test_strings({string1:"dog",string2:"intergalactic penal colony",method:method})
irb(main):015:0> end
String1=dog, String2=intergalactic penal colony, Method=dot, Output=0.7234181954241339
String1=dog, String2=intergalactic penal colony, Method=cosine, Output=0.7234182114732238                        
String1=dog, String2=intergalactic penal colony, Method=euclidean, Output=0.743749665399304                      
String1=dog, String2=intergalactic penal colony, Method=manhattan, Output=18.737472161003208                     
=> ["dot", "cosine", "euclidean", "manhattan"]
irb(main):016:1* methods.each do |method|
irb(main):017:1*   Embeddings.test_strings({string1:"dog",string2:"cat",method:method})
irb(main):018:0> end
String1=dog, String2=cat, Method=dot, Output=0.9164348051073471
String1=dog, String2=cat, Method=cosine, Output=0.9164348102929122
String1=dog, String2=cat, Method=euclidean, Output=0.40881582463070715
String1=dog, String2=cat, Method=manhattan, Output=10.538499585644807
=> ["dot", "cosine", "euclidean", "manhattan"]
irb(main):019:1* methods.each do |method|
irb(main):020:1*   Embeddings.test_strings({string1:"dog",string2:"dog",method:method})
irb(main):021:0> end
String1=dog, String2=dog, Method=dot, Output=0.9999999483598027
String1=dog, String2=dog, Method=cosine, Output=1.0000000000000002
String1=dog, String2=dog, Method=euclidean, Output=0.0
String1=dog, String2=dog, Method=manhattan, Output=0.0
=> ["dot", "cosine", "euclidean", "manhattan"]
irb(main):022:1* methods.each do |method|
irb(main):023:1*   Embeddings.test_strings({string1:"dog",string2:"cnn news",method:method})
irb(main):024:0> end
String1=dog, String2=cnn news, Method=dot, Output=0.8004253665328107
String1=dog, String2=cnn news, Method=cosine, Output=0.8004253608912667           
String1=dog, String2=cnn news, Method=euclidean, Output=0.6317826216593369        
String1=dog, String2=cnn news, Method=manhattan, Output=16.038694799962787        
=> ["dot", "cosine", "euclidean", "manhattan"]
irb(main):025:1* methods.each do |method|
irb(main):026:1*   Embeddings.test_strings({string1:"dog",string2:"puppy",method:method})
irb(main):027:0> end
String1=dog, String2=puppy, Method=dot, Output=0.9250001346002894
String1=dog, String2=puppy, Method=cosine, Output=0.9250001281615122              
String1=dog, String2=puppy, Method=euclidean, Output=0.3872980050569614           
String1=dog, String2=puppy, Method=manhattan, Output=9.7229723788028              
=> ["dot", "cosine", "euclidean", "manhattan"]
irb(main):028:1* methods.each do |method|
irb(main):029:1*   Embeddings.test_strings({string1:"dog",string2:"warrior",method:method})
irb(main):030:0> end
String1=dog, String2=warrior, Method=dot, Output=0.8396468896184742
String1=dog, String2=warrior, Method=cosine, Output=0.839646907086438
String1=dog, String2=warrior, Method=euclidean, Output=0.566309261053687
String1=dog, String2=warrior, Method=manhattan, Output=14.335594067517825
=> ["dot", "cosine", "euclidean", "manhattan"]
irb(main):031:1* methods.each do |method|
irb(main):032:1*   Embeddings.test_strings({string1:"Star Wars",string2:"space opera",method:method})
irb(main):033:0> end
String1=Star Wars, String2=space opera, Method=dot, Output=0.885637672420478
String1=Star Wars, String2=space opera, Method=cosine, Output=0.8856376051379655        
String1=Star Wars, String2=space opera, Method=euclidean, Output=0.47825182393844845
String1=Star Wars, String2=space opera, Method=manhattan, Output=12.30534823390999
=> ["dot", "cosine", "euclidean", "manhattan"]                         
irb(main):034:0> 

On the other hand, I would have expected “Star Wars” and “space opera” to have a much better similarity score. Like you said, “embedding seems broken”, TBH.

Anyway, I’m just playing around trying to understand these embeddings and how accurate they are in practice. So far, the test results defy common sense and intuition at times (and often).

2 Likes

One of the things I have have seen is that the ada-002 embeddings tend to be pretty bad at the one or two word level (~5 tokens). So I am wondering if they get better as the number of words increases. The window size is 8k tokens, so lots of words. And so I’m guessing the optimal spot isn’t 8k tokens, or even 4k tokens, but maybe 100 to 1000 tokens. Not sure. But there is an optimum somewhere, I’m guessing, greater than 5 tokens and probably less than 2k tokens. Just a guess.

It’s hard to measure empirically, I know.

But I think one of the better practices is to slice your data into sentences, paragraphs and pages. Then correlate at each level. Then look at the sentences and paragraphs (and pages) surrounding the top hits for more context. Then feed this context through the AI engine to get it to “think” and generate more questions and responses. This should create killer prompts with amazing well-founded super-human results if it’s put together well.

3 Likes

Last crude test for now, loops thought “methods” and a comparison string (to dog). output is the similarity “score” for the method and strings. Added a little sleep so not to be “rate-limited” out of the loop :slight_smile:

irb(main):075:0> methods
=> ["dot", "cosine", "euclidean", "manhattan"]
irb(main):063:0> compare=["cat", "asteroid", "rock fish", "submarine", "gemstone","dog food","chatgpt"]
=> ["cat", "asteroid", "rock fish", "submarine", "gemstone", "dog food", "chatgpt"]
irb(main):069:1* compare.each do |phrase|
irb(main):070:2*   methods.each do |method|
irb(main):071:2*     Embeddings.test_strings({string1:"dog",string2:phrase,method:method}); sleep 5
irb(main):072:1*   end
irb(main):073:0> end
String1=dog, String2=cat, Method=dot, Output=0.9164348051073471
String1=dog, String2=cat, Method=cosine, Output=0.9164348102929122                                                     
String1=dog, String2=cat, Method=euclidean, Output=0.40881582463070715                                                 
String1=dog, String2=cat, Method=manhattan, Output=10.538499585644807
                                                  
String1=dog, String2=asteroid, Method=dot, Output=0.8246138517753474                                                   
String1=dog, String2=asteroid, Method=cosine, Output=0.8246138832545461                                                
String1=dog, String2=asteroid, Method=euclidean, Output=0.592260263820192                                              
String1=dog, String2=asteroid, Method=manhattan, Output=14.8466230262702   
                                            
String1=dog, String2=rock fish, Method=dot, Output=0.8351781225977196                                                  
String1=dog, String2=rock fish, Method=cosine, Output=0.8351781121091175                                               
String1=dog, String2=rock fish, Method=euclidean, Output=0.5741461311561782                                            
String1=dog, String2=rock fish, Method=manhattan, Output=14.63262434113021 
                                            
String1=dog, String2=submarine, Method=dot, Output=0.8356327264031151                                                  
String1=dog, String2=submarine, Method=cosine, Output=0.8356327585001753
String1=dog, String2=submarine, Method=euclidean, Output=0.5733537044205766
String1=dog, String2=submarine, Method=manhattan, Output=14.554428136017805

String1=dog, String2=gemstone, Method=dot, Output=0.8032192820692583
String1=dog, String2=gemstone, Method=cosine, Output=0.803219292286573
String1=dog, String2=gemstone, Method=euclidean, Output=0.6273447301289572
String1=dog, String2=gemstone, Method=manhattan, Output=15.739283096362826

String1=dog, String2=dog food, Method=dot, Output=0.9305799323955518
String1=dog, String2=dog food, Method=cosine, Output=0.9305799640298726
String1=dog, String2=dog food, Method=euclidean, Output=0.3726124893512008
String1=dog, String2=dog food, Method=manhattan, Output=9.488634241251495

String1=dog, String2=chatgpt, Method=dot, Output=0.8203660566183042
String1=dog, String2=chatgpt, Method=cosine, Output=0.820366052379735
String1=dog, String2=chatgpt, Method=euclidean, Output=0.5993896037609885
String1=dog, String2=chatgpt, Method=manhattan, Output=15.415643926722199
=> ["cat", "asteroid", "rock fish", "submarine", "gemstone", "dog food", "chatgpt"]
irb(main):074:0> 

HTH

Same thing, but change the order of the loops, for fun…

irb(main):076:1* methods.each do |method|
irb(main):077:2*   compare.each do |phrase|
irb(main):078:2*     Embeddings.test_strings({string1:"dog",string2:phrase,method:method}); sleep 5
irb(main):079:1*   end
irb(main):080:0> end
String1=dog, String2=cat, Method=dot, Output=0.9164348051073471
String1=dog, String2=asteroid, Method=dot, Output=0.8246138517753474
String1=dog, String2=rock fish, Method=dot, Output=0.8351781225977196
String1=dog, String2=submarine, Method=dot, Output=0.8356327264031151
String1=dog, String2=gemstone, Method=dot, Output=0.8032192820692583
String1=dog, String2=dog food, Method=dot, Output=0.9305799323955518
String1=dog, String2=chatgpt, Method=dot, Output=0.8203660566183042 

String1=dog, String2=cat, Method=cosine, Output=0.9164348102929122  
String1=dog, String2=asteroid, Method=cosine, Output=0.8246138832545461
String1=dog, String2=rock fish, Method=cosine, Output=0.8351781121091175
String1=dog, String2=submarine, Method=cosine, Output=0.8356327585001753
String1=dog, String2=gemstone, Method=cosine, Output=0.803219292286573
String1=dog, String2=dog food, Method=cosine, Output=0.9305799640298726
String1=dog, String2=chatgpt, Method=cosine, Output=0.820366052379735

String1=dog, String2=cat, Method=euclidean, Output=0.4089459068501862
String1=dog, String2=asteroid, Method=euclidean, Output=0.592260263820192
String1=dog, String2=rock fish, Method=euclidean, Output=0.5741461311561782
String1=dog, String2=submarine, Method=euclidean, Output=0.5733537044205766
String1=dog, String2=gemstone, Method=euclidean, Output=0.6273447301289572
String1=dog, String2=dog food, Method=euclidean, Output=0.3726124893512008
String1=dog, String2=chatgpt, Method=euclidean, Output=0.5993896037609885

String1=dog, String2=cat, Method=manhattan, Output=10.538499585644807
String1=dog, String2=asteroid, Method=manhattan, Output=14.8466230262702
String1=dog, String2=rock fish, Method=manhattan, Output=14.63262434113021
String1=dog, String2=submarine, Method=manhattan, Output=14.554428136017805
String1=dog, String2=gemstone, Method=manhattan, Output=15.739283096362826
String1=dog, String2=dog food, Method=manhattan, Output=9.488634241251495
String1=dog, String2=chatgpt, Method=manhattan, Output=15.415643926722199
=> ["dot", "cosine", "euclidean", "Manhattan"]

One could argue that, in this data set, the Manhattan method is preferable; but it’s a matter of use case and preference, etc.

2 Likes

Same thoughts here… When I have more time, I’ll repeat these tests with larger phrases with a denser token count, for fun.

Feel free (anyone) to provide text for the tests :slight_smile:

1 Like

It’s hard to tell for sure. Can you sort these as most similar to most dissimilar according to each metric?

Maybe have:

‘dot’ (largest to smallest since it is a similarity)
1)
2)
3)

‘euclidean’ (smallest to largest since it is a distance)

‘Manhattan’ (smallest to largest since it is a distance)

Also note that your cosine and dot values are the same, which is expected with unit vectors.

1 Like

Here 'ya go, @curt.kennedy

  • Removed the cosine method since it is the same as dot for the unit vector
  • Sorted, as requested, I think :slight_smile:
String1=dog, String2=dog food, Method=dot, Output=0.9305799323955518
String1=dog, String2=cat, Method=dot, Output=0.9164348051073471                                                              
String1=dog, String2=submarine, Method=dot, Output=0.8356327264031151                                                        
String1=dog, String2=rock fish, Method=dot, Output=0.8351781225977196                                                        
String1=dog, String2=asteroid, Method=dot, Output=0.8246138517753474                                                         
String1=dog, String2=chatgpt, Method=dot, Output=0.8203660566183042                                                          
String1=dog, String2=gemstone, Method=dot, Output=0.8032192820692583      
                                                   
String1=dog, String2=dog food, Method=euclidean, Output=0.3726124893512008                                                   
String1=dog, String2=cat, Method=euclidean, Output=0.40881582463070715                                                       
String1=dog, String2=submarine, Method=euclidean, Output=0.5733537044205766                                                  
String1=dog, String2=rock fish, Method=euclidean, Output=0.5741461311561782                                                  
String1=dog, String2=asteroid, Method=euclidean, Output=0.592260263820192                                                    
String1=dog, String2=chatgpt, Method=euclidean, Output=0.5993896037609885                                                    
String1=dog, String2=gemstone, Method=euclidean, Output=0.6273447301289572  
                                                 
String1=dog, String2=dog food, Method=manhattan, Output=9.488634241251495
String1=dog, String2=cat, Method=manhattan, Output=10.538499585644807
String1=dog, String2=submarine, Method=manhattan, Output=14.554428136017805
String1=dog, String2=rock fish, Method=manhattan, Output=14.63262434113021
String1=dog, String2=asteroid, Method=manhattan, Output=14.8466230262702
String1=dog, String2=chatgpt, Method=manhattan, Output=15.415643926722199
String1=dog, String2=gemstone, Method=manhattan, Output=15.739283096362826

Seems to be the the Euclidean method has the “nicest” (“prettiest”) dynamic range.

Agree or not?

It seems the embedding is always(?) spikey in the same way. It could be positional encoding or something, as it is similar in the less extreme positions as well. This does move the similarity together - even just clamping the negative spike at 196 and positive at 956 makes a difference.

2 Likes

Here I am looking at if the embedding ranking changed depending on the metric used. Looks like it’s the same no matter how you measure the distance/similarity.

So it’s just a matter of which one is more efficient at computing. I’m thinking the Manhattan one is the most efficient (just a vector delta and absolute value). Second less expensive is the dot product. And most expensive is the euclidean distance.

I may have to try the Manhattan and see if that is truly faster in production. I can fly through 400k embeddings using dot product in about 1 second without using a vector database. Maybe I can hit 1M or more using Manhattan. Thanks for the testing/experimenting. It is valuable!

1 Like

Is it always spikey? Or are there just a few that have the spikes?

I’ve only tried 50 or so sentences, but it seems to always be spikey, and always in the same way (like that big dip down always happens at the 196’th float, and the smaller up one at 956). I tried single words and giant sentences as well, same results. If I zoom in the ups and downs are still ‘somewhat’ similar. Not sure if that is the same for everyone, maybe it’s even interleaving the api key or something.

1 Like

Well that is interesting and it starts to explain my original observation that embeddings from ada-002 only seem to span a cone of about 54 degrees wide.

It’s all of these correlated coordinates in the embedding vectors.

Why are they so correlated? No idea.