Some questions about text-embedding-ada-002’s embedding

curt.kennedy · July 29, 2023, 6:14pm

After you get your fit, you transform the new embedding to fit back into your PCA, it’s listed as a comment at the bottom, but here it is again

# When working with live data with a new embedding from ada-002, be sure to tranform it first with this function before comparing it
#
# def projectEmbedding(v,mu,U):
#     v = np.array(v)
#     v_tilde = v - mu
#     v_projection = np.zeros(len(v)) # start to build the projection
#     # project the original embedding onto the PCA basis vectors, use only first D dimensions
#     for u in U:
#         v_jj = np.dot(u,v) # scaler
#         v_projection += v_jj*u # vector
#     v_prime = v_tilde - v_projection # final embedding vector
#     v_prime = v_prime/np.linalg.norm(v_prime) # create unit vector
#     return v_prime

Note, in this context, you are only using PCA to find “important basis vectors” but you are NOT reducing dimensions. So if using ada-002, for example, you still get a transformed new vector with 1536 dimensions.

The PCA here is de-biasing things, and spreading them out, making them more isotropic, but not changing the fundamental dimension of the correlation vector. I haven’t tried going further to actually reduce the dimensions, just trying get a larger spread in the vectors, emphasizing top (or important) dimensions in the space.

For similarity, since I scale it back out to a unit vector, you just use dot-products or “cosine similarity” for the search.

tanmayjuneja8 · July 31, 2023, 4:23pm

Great! Thanks a lot for your response. @curt.kennedy

Just one more question, if we keep in mind the dimensions for other OS embedding models, is the same process repeatable for other embedding models as well?

curt.kennedy · July 31, 2023, 8:44pm

Yes the code is pretty much dimension agnostic. The only thing is this hyperparameter, that you should play with, but depends on initial vector dimension:

D = 15 # hyperparameter on dimension (out of 1536 for ada-002), paper recommeds D = Dim/100

bobartig · July 31, 2023, 9:51pm

I realize I am necro-ing this post from January, but I recently started messing around with domain-specific vector embedding ideas to try and improve relevant retrieval performance. Across legal data documents, ada-002 can be “less-than-impressive” as the knowledge base becomes more heterogeneous.

I’d recently thought of a very similar idea to this, where either the vector query, or some combination of the top retrieved text are fed to the LLM in order to generate a series of synthetic queries that are used to retrieve even more relevant documents from the vector store. Has anyone tried this or similar techniques to build a “retrieval chain”? e.g. asking the same query in several different ways, then do something locally with all of the retrieved bits → (waves hands) → feed context to GPT with a suitable prompt template → Magic Results!!!

curt.kennedy · July 31, 2023, 9:58pm

You could try a hybrid search that utilizes embeddings (which are dense vectors) and other algorithms that use sparse vectors (BM25), with some sort of averaging between the two (harmonic) that gets you locked in.

This could help in situations that give less than impressive results, especially when the data is a bunch of keywords, or not long cohesive chunks of text:

mbabayev · October 7, 2023, 1:12pm

You have a mistake in U = U[0:D,:]
In my case it transforms U shapes from (95, 1536) to (15, 1536) when the target shape should be (95, 15). It should reduce the 1536 to 15, not the number of samples.

curt.kennedy · October 7, 2023, 2:45pm

@mbabayev

I am just following the “Algorithm 1” in the “All But The Top” paper mentioned above.

They pick out the top D dimensions of the PCA basis vectors to represent the embedding, which increases the isotropy, or spread, of the embedding vectors.

They don’t reduce the dimensions in the correlation, at least that is what “Algorithm 1” looks like to me, but if you insist on reducing dimensions, it looks like you want the v_jj scalers in the code. But I didn’t use those, since it looked like the paper didn’t use them.

Also, picking the top PCA components and reframing the embedding vector with these components and keeping the high 1536 dimensions has the advantage of: if you shift how many dimensions you want over time, you still have the same length vectors, and don’t need to “re-reduce” all the vectors again.

I was thinking of using this operationally on large databases, so I didn’t want to re-compute everything if I made a minor change. Minor changes just lead to minor shifts in the vectors, not dimensional changes that make them incompatible.

So think of D as a tunable parameter, where if D = 1536, you get exactly the original Ada-002 vector back, and if D < 1536, you are getting back a vector with less correlation and anisotropy and more meaningful spread and variation, which gives a much higher range to your cosine similarity, which was the original goal.

All vectors produced have 1536 dimensions, which makes them compatible with each other as you vary the tunable parameter D.

mbabayev · October 8, 2023, 10:06am

Then, I don’t understand. Assume, you have 95 samples of 1536 items each. You take PCA(15) and receive 95 vectors of 15 items each?

for jj in range(D):
        u_jj = U[jj,:] # vector
        v_jj = np.dot(u_jj,v) # scaler
        v_projection += v_jj*u_jj # vector

Why not just do PCA.fit_transform?

curt.kennedy · October 8, 2023, 2:36pm

If you are fitting 95 vectors, you follow ‘Algorithm 1’ in the paper and you get 95 vectors with 1536 dimensions each. Here is ‘Algorithm 1’:

Because the paper doesn’t do this!

The paper is performing PCA on the unbiased collection of embedding vectors, this is the X_tilde variable in the code, which is defined as X - mu, where X is the set of embedding vectors you want to create a fit for, and mu is the average of these vectors.

Then you downselelect (with D) how many PCA vectors you want to represent your set of embeddings. So the code you quoted is essentially dropping unimportant information by re-expressing each embedding vector using the top 15 PCA vectors, which according to PCA principals, are the top 15 vectors that describe the most variance in the data set. So the remaining 1521 vectors that PCA computed are discarded, since the variance they express is small. But you still get a full 1536 dimension vector when you use the top 15 PCA basis vectors to represent each vector. (See ‘Algorithm 1’)

Also, 95 samples is probably on the low side. Here I fit and transformed 63105 different embedding vectors. So if it’s not clear, I put in 63105 vectors, each of length 1536, and got out 63105 vectors of length 1536. I put all these in a dictionary with 63105 key/value pairs where the value is the new embedding vector with 1536 dimensions and the key is the Hash of the underlying text that was embedded. This Hash is used to index the original text in another database, not shown in the code.

Total number of embeddings: 63105
First two embeddings are: 
[ 0.00418844 -0.0094733  -0.01019533 ... -0.0273703  -0.03086011
 -0.03326688]
First embedding length: 1536
[ 0.00499764  0.01619455  0.01353865 ... -0.00447941 -0.01553382
 -0.01027383]
Second embedding length: 1536
Mean embedding vector: [-0.0078587  -0.00596242 -0.0014923  ... -0.01967299 -0.01243456
 -0.02973328]
Mean embedding vector length: 1536
Performing PCA on the normalized embeddings ...
PCA finished in 8.220144271850586 seconds ...
Shape of full set of PCA componenents (1536, 1536)
Shape of downselected PCA componenents (15, 1536)
Finished with 1 embeddings out of 63105 (0% done)
Finished with 6311 embeddings out of 63105 (10% done)
Finished with 12621 embeddings out of 63105 (20% done)
Finished with 18931 embeddings out of 63105 (30% done)
Finished with 25241 embeddings out of 63105 (40% done)
Finished with 31551 embeddings out of 63105 (50% done)
Finished with 37861 embeddings out of 63105 (60% done)
Finished with 44171 embeddings out of 63105 (70% done)
Finished with 50481 embeddings out of 63105 (80% done)
Finished with 56791 embeddings out of 63105 (90% done)
Finished with 63101 embeddings out of 63105 (100% done)
Finished with 63105 embeddings out of 63105 (100% done)

Also, don’t forget that when a new embedding vector comes in, after you have all this fitted data, you have to transform it by using the function call to projectEmbedding(v,mu,U) where v is the original vector from ada-002, mu is the saved vector average of the original fit, and U is your collection of top PCA basis vectors (15 is the default). This is mentioned above here:

gruff · October 19, 2023, 8:33am

I’d assumed it was a just a consquence of high dimensionality (and the large quite heterogeneous data sets it was trained on) but this conversation prompted me to question that (thanks everyone - always good to question assumptions!)

Background: You get similar cosine ranges for completely random vectors pairs in (say) D=1536 (normalised or not), if you only generate positive values for the vectors. This is (very) loosely equivalent to restricting your vectors to a subspace which would happen naturally in any system that is learning to embed text as vectors.

My naive assumption: Given the large number of representations available (the original space was higher I believe, and dimension-reduced for ada-002), plus multiple models combined into one, I thought it likely that many dimensions would represent features which are not necessarily of linguistic interest to our (my) applications, but capture anything common to various datasets. I figured it was extremely unlikely to be able to construct any embedding which such that it was exactly opposite in every sense represented (which is necessary to get a negative cosine), and so the mean value for “uncorrelated in features of interest” tends towards 0.707 (e.g. 45º) away from 0. (NB: “anticorrelated in features of interest” isn’t further away than 0.707 here - we’d need to isolate the features of interest first if we wanted to do that, or figure out where anticorrelated concept are, and do CS with those and see which we were closer to).

I also thought that given high D and the heterogeneity (multiple languages, human and code, and the huge set of things that could be represented by them), the chances of getting a cosine similarity lower than 0.4 seemed unlikely.

For my similiarity comparisons, to restore dynamic range, I just did arccos(cosine_sim) look at the angular separations in the context of known similar and dissimilar average values the embedding of interest. It’s more intuitive to work with angle in degrees, and a separation range of 10º to 25º is actually pretty big in 1536 dimensions (on a 1535-sphere).

I couldn’t find any sentences in English, French or junk symbols that produced embeddings further than 45 degrees, but someone in another thread suggested two that were about 60 degrees from everything (and each other). (I’ve developed better similarity functions since, but they’re quite specific to our needs, so unlikely to be useful to others).

Here’s a plot of mutual angular separations for a selection:

It’s worth highlighting “Huge rich woman” and “Tiny poor man” have an angular separation of 35 degrees, which is pretty far apart

It seems this is a problem for lots of high D models, for a number of reasons.

I’d read that Manhattan Distances might perform better, but in my experiments they didn’t add value and didn’t behave very consistently.

The issue of there being common features to all embeddings of interesting wasn’t surprising, but I was surprised to see how large the contributions of the so-called “dominant” dimensions were.

It’s clear there are numerous drivers in play that lead to anistropic embeddings. One important one may be multimodalities. I’ve no idea if ada-002 was updated using something from GPT4, but perhaps that could help explain why some dimensions are so stubbornly present across all languages, and even code. Presumably audio, images and videos embeddings are why text is squished into a narrow section of the available space?

I found these papers very helpful (the first suggested by you downthread - thanks- led to me finding the others). I don’t seem to be able to post links but here are the titles - they’re all on arXiv:

All-but-the-Top: Simple and Effective Postprocessing for Word Representations (2017; Mu et al. )
Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning (2022; Liang et al.)
All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality (2021: Timkey et al.)
SimCSE: Simple Contrastive Learning of Sentence Embeddings. (Gao et al.)

curt.kennedy · October 19, 2023, 2:58pm

FWIW, you can programmatically generate random tokens and embed and correlate their corresponding string representations.

Here is a program I wrote to do this. I generates 10 random pairs, of up to 5 tokens each, and embeds and correlates. Sadly, nothing much below 0.6 cosine similarity. Where you expect it to go all the way down to -1.

Because this is as random as it gets, and you can try yourself, it means the model really is limited to these small angles.

import requests
import numpy as np
import tiktoken
import math

enc = tiktoken.get_encoding("cl100k_base")

N = 100263 # max non-special token for cl100k_base
K = 5 # max number of tokens per message synthesized

HEADERS = {"Authorization": f"Bearer OPENAI_KEY",
                "Content-Type": "application/json"}

for ii in range(0,10):

    K0 = np.random.randint(low = 1,high=K,size=1)
    K1 = np.random.randint(low = 1,high=K,size=1)

    Msg0 = enc.decode(list(np.random.randint(low = 0,high=N,size=K0)))
    Msg1 = enc.decode(list(np.random.randint(low = 0,high=N,size=K1)))

    # Msg0 = enc.decode(list([100000]))
    # Msg1 = enc.decode(list([0]))


    Payload0 = {
    "model": "text-embedding-ada-002",
    "input": Msg0
    }

    Payload1 = {
    "model": "text-embedding-ada-002",
    "input": Msg1
    }

    r = requests.post("https://api.openai.com/v1/embeddings",json=Payload0,headers=HEADERS)
    q = r.json()
    Embedding0 = q['data'][0]['embedding']

    r = requests.post("https://api.openai.com/v1/embeddings",json=Payload1,headers=HEADERS)
    q = r.json()
    Embedding1 = q['data'][0]['embedding']

    v0 = np.array(Embedding0)
    v1 = np.array(Embedding1)

    c = np.dot(v0,v1)

    print("##################")
    print(f"Msg0: {Msg0}")
    print(f"Msg1: {Msg1}")
    print(f"Dot Product: {c}")

Sample responses:

##################
Msg0:  vouchers lettuce.*)
Msg1:  Presidency
Dot Product: 0.7396648792130621
##################
Msg0:  capitalizeDark Hitsasionally
Msg1: .into jewishummer
Dot Product: 0.7404158970421753
##################
Msg0: �� francaise/functions
Msg1:  subscriptions
Dot Product: 0.7244326191073213
##################
Msg0: occer Christopher/target
Msg1:  grounded QCOMPARE subsid condolences
Dot Product: 0.7061664648486091
##################
Msg0: untsmareuarios
Msg1:  unidentified JSONExceptionspd Thankfully
Dot Product: 0.6889429355275489
##################
Msg0:  Klan IDirect
Msg1:  verw urging/Stringsenal
Dot Product: 0.7368004186265422
##################
Msg0:  UPS�
Msg1: aligned EH
Dot Product: 0.7257718601469113
##################
Msg0: .Direction Lorenzo wor Pou
Msg1: /op
Dot Product: 0.7623110387622165
##################
Msg0: -growLesson disconnected
Msg1:  Liberty.)
 Fold_li
Dot Product: 0.6853646621544798
##################
Msg0:  cannabinoya warfare Marilyn
Msg1: .Generation
Dot Product: 0.7412980702686944

_j · October 19, 2023, 5:01pm

Your nonsense tokens where appearance of 3 together in neurons=0 put the AI into a nonsense state:

davinci-002 with repetition penalties maxed.

They would put ChatGPT in a state of “looks like you fed me garbage”.

So compare the vectors, sure, what they have in common is no meaning.

What would be more interesting is some visualization (1536 pixels wide?) that lets us see where these differences are between a token “])\n” and " explain", and if there’s a dimensional subset that can serve the purpose “word embedding”.

curt.kennedy · October 19, 2023, 5:09pm

Look up , we’ve already trended across the 1536 dimensions, and there are definite huge spikes in certain dimensions, here:

Non-sense state or coherent semantic state, you still get limited cosine similarities. In fact, it’s pretty obvious that when you generate nonsensical strings, you get the farthest embeddings out of ada-002. They don’t have a “nonsense cone” subspace, and a “semantic coherency cone” subspace. It’s all one big subspace living in a tiny cone driven by biases in specific dimensions of the embedding space. (see above)

The real question is what is driving the huge spikes?

You can mitigate the spikes with the ABTT procedure above, but it requires post processing your previous embeddings, and altering each incoming embedding to align with the new fit. So not a very convenient solution.

_j · October 19, 2023, 5:44pm

Or a real question behind my rambling: how is the language model of embeddings fine tuned, and how does that affect weights, vectors, embeddings? When the goal is not a completion that turns “monkey story” into a monkey story, but into a state similarity with stories about monkeys. Or “औᏍ” into “that’s two different kind of Indians” or “What??”.

OpenAI has over a dozen GPT-3 experiments they’ll retire, now replaced with one model and its trade secrets. Take for example:

text-similarity-babbage-001
code-search-babbage-text-001
code-search-babbage-code-001
text-search-babbage-doc-001

For a codex embedding, does one just train it more on code, and then it is able to distinguish more sequence semantics. Does that then matter 0% for a single token, though? Or what fine-tune works to make a “doc” model compare big to small?

One might extract 50k single-token embeddings for posterity before these models go away. Or 50k^2 of them.

curt.kennedy · October 19, 2023, 6:21pm

No idea if internal embeddings are changed with a fine-tune. I just assumed the neural weights changed. The main reasoning is that the semantics, once trained, shouldn’t change, and the fine-tune just reshapes the output from the input (unchanged) semantics.

Yes the codex is primarily trained on code. But rolling everything into one new model, and deprecating the rest, suggest they went with a Mixture of Experts (MoE) thing similar to GPT-4 (rumors, I know), and this consolidation would basically get the new models all on the same architecture, possible saving some money in the process.

Go for it!

gruff · October 20, 2023, 12:11pm

Yes, it’s a shame they haven’t released much information on the internals, or access to the embedding model. Perhaps they will later as they did with gpt-2 eventually. It would be nice to know how the embeddings are created and access the decoder part of the model to generate texts using embeddings (but I think this likely poses a security risk?)

Like you, the lowest scores I found were ~0.6 (52.5 degrees) and these were generally in different domains (e.g. natural language vs code) which makes sense.

I did find lower scores using two strings found by @anon22939549 in this post: /t/fine-tuning-or-update-embedding-of-a-string/320955/9

He said he found them quickly but didn’t say how.

I don’t think it’s worth spending much time worrying about the dynmamic range though. It’s really not an issue.

It’s also illusory IMO: it’s just not what people are used to from smaller models, but it’s easily restored just by renormalising the values. Subtract 0.75
(as I mentioned before, that won’t restore the expected order of “uncorrelated” and “anticorrelated” from a semantic perspective - if that’s needed - I don’t need that for now in my work).

Simple example similarity function

def similarity_transform(cosine_sim, min_deg = 25, max_deg = 37):
    # 2023-10-20 Author: Gruff Davies
    """
    Dependencies: import numpy as np 
    Transforms and renormalises cosine similarities between ada-002 embeddings to give more useful predictive values.
    Output ranges from (0, 1) not (-1, +1) since "anti-similar" is just not likely to ever occur in high D representations.
    1. First converts cosine similarity into angular separation in degrees which significantly helps "stretch" squashed results 
    in to a more useful range.
    2. renormalises range using supplied min and max separations (experiment to find good values for your data)

    Returns:
    - a similarity metric in the range (0,1)   
    """      
    deg = np.degrees(np.arccos(cs)) # optional, but nice and intuitive 
    similarity = (max_deg - np.clip(deg, min_deg, max_deg)) / (max_deg-min_deg)
    return similarity

More sophisticated approaches

One way to do that is figure out the feature dimensions of linguistic interest to ones specific application and project onto a smaller space with those bases only, (making sure to send the irrelevant dominant bases to null). Identifying the noise contributors seems fairly straight-forward. Even applying a “get rid of the top 1%” strategy seems to be effective according to the paper you shared, but I think a bit analysis first would be prudent.

I would caution though, that expecting -1, 0 and +1 for cosine similarities is really a heuristic learned from the early days of word2vec which has aged badly now we have high D embeddings. Even if you get a clean large set of semantically meaningful feature bases, you will still need every single feature reversed to get a strong negative comparison. If you have several hundred, this just isn’t going to happen. Almost all texts have so much in common they’re bound to produce high similarity scores.

Where specific features are of interest (e.g. sentiment), extract those features (create a transform map to a small set) and then do CS.

Personally, I wouldn’t bother to calculate averages vectors and subtract them as that depends too heavily on generating from a good set, though it is a good idea and seems to work, it might introduce unexpected bias and fail on certain outliers.

Gruff

Oh, one final thought: I get why you might be perplexed^[1] about the embeddings “living in a tiny cone” but I would actually question that.

The embeddings seem to be in a hypercone with an apex angle of ~60 degrees in 1536 dimensions. That’s unimaginably vast. I’ve only tested small texts. I don’t know how large texts of 8k tokens embed - have you - or anyone reading this thread - done tests on big texts? Maybe they extend further?

As a fraction of the entire embedding space it seems small, but that space is so unimaginably in the first place it’s almost inevitable. The hypersphere itself is tiny compared the volume of the embedding space (they actually tend to 0% the size of the unit hypercube in the limit).

It’s only an issue if the anisotropy impacts performance, which I don’t think it does. It feels weird from a 3D perspective, but it’s hard to develop good intuitions about very high D spaces. Things get quite weird.

Apologies if this word doesn’t reflect your actual sentiment. I couldn’t think of anything better! ↩︎

gruff · October 20, 2023, 12:20pm

Several candidates for the huge spikes:

Often the first two biggest encode word frequency information (since this is an essential property given the model’s loss function (predicting next and oblated words).
Domain separation (when models are combined, the new models tend to use one or more feature vectors to represent the distinct sources. E.g. features may encode (human language, computer language etc.)
Multimodal training (not sure if this applies to ada-002) but if it’s come from GPT4 then it will. The different types of inputs (images etc.) will end up living in a completely different part of the latent space.
Other Non-human interpretable semantic representations important for the model to work.
Artefacts from the initialisation of weights and early epochs that the model just evolved around and never got rid of. (There’s some evidence for this in various papers in the literature for other models, like BERT).

curt.kennedy · October 20, 2023, 3:06pm

Agree. The model has enough dynamic range because of the large number of dimensions, and that it is using floats. The 60 degree wide hypercone seems plenty for most folks using the model.

But on this forum, we do get complaints about the models poor geometry, and so the answer is basically: " CHILL + + tighten your limits. "

I haven’t done many 8k embeddings, even though this is the superpower of ada-002, compared to the competitors that are often limited to 512 or 1024 tokens.

But I can say that when doing the random token generation and cosine similarity tests, the more tokens the smaller the cone. So I would expect the cone to shrink as the input gets larger.

Yeah, I was surprised by that too. But I think the trick is embedding two different domains. For example, code syntax as one, and random Shakespeare as the other. The noise-on-noise case I coded above could work, but it’s like waiting for the monkeys to type Shakespeare. So not at all efficient.

The algorithm in the paper removes the DC term, then uses PCA to do exactly what you are saying by picking basis vectors that explain the majority of the variance. In fact you can take the ABTT code I wrote and find the exact percentage of variance explained by the basis vectors you down-select.

It’s also one of the basic tenants of multi-dimensional calculus, which is where I at least got my expectations. But, over the past year I’ve had to adjust my geometric expectations of the model. I’ve had to rid myself of all the calculus brainwashing

gruff · October 20, 2023, 3:47pm

Yeah, I had to chuck all my old heuristics out the window too. I didn’t realise quite how weird and unintuitive higher dimensional geometries get (and I thought I was used to weird extra dimensions coming from a physics background, lol). At least the metrics are all positive - for now…

This would be good to validate. What token count range did you explore? I suppose the bigger a text, the more likely it is it’ll have things in common with other texts… which definitely supports a multilevel chunking approach if search is the downstream task. That’s useful to know (will have to test, but it sounds about right from the things being built with it).

That may also explain why the first token can have such a huge impact on the vector (sometimes in a bad way). Maybe later tokens get less weight in the vector. That would certainly explain some of the phenomena I’ve noticed (e.g. two short sentences vs 2 x 1 short sentence; matching to the second second performs poorly when it’s in the same embedding as S1, but perfectly on its own.)

I have some vector flow analysis in mind to help elucidate this further, so this is very helpful to inform how I go about that.

Gruff

curt.kennedy · October 20, 2023, 4:02pm

Just take my code above and change K. Warning: The tokens picked range from 1 to K, but statistically higher K is more tokens.

So K = 100:

##################
Dot Product: 0.7978934631413851
##################
Dot Product: 0.8247398239969408
##################
Dot Product: 0.8665117823299344
##################
Dot Product: 0.820598470499076
##################
Dot Product: 0.8321468762227962
##################
Dot Product: 0.747621134385094
##################
Dot Product: 0.8312215016983546
##################
Dot Product: 0.8209881173920491
##################
Dot Product: 0.8109881694327411
##################
Dot Product: 0.8285673604488077

K = 5:

##################
Dot Product: 0.7100092212174525
##################
Dot Product: 0.6724265588578651
##################
Dot Product: 0.6917097045223343
##################
Dot Product: 0.7241351714913383
##################
Dot Product: 0.7691619614766912
##################
Dot Product: 0.7038750012490307
##################
Dot Product: 0.687420488096619
##################
Dot Product: 0.7369462207756796
##################
Dot Product: 0.6848995504867822
##################
Dot Product: 0.7598098060645948

You can see that higher K has higher cosine similarity.

To make the analysis more precise, you can have the Monte Carlo trial pick a fixed K tokens, and not range from 1 to K.

Oh, and for the hell of it, K = 8000. You should probably shift to N = 100000 to prevent the out of range errors.

K = 8000

##################
Dot Product: 0.9384593384237251
##################
Dot Product: 0.9487260624253262
##################
Dot Product: 0.9191063791462182
##################
Dot Product: 0.9486079631498802
##################
Dot Product: 0.91553473580914
##################
Dot Product: 0.9487232672473378
##################
Dot Product: 0.9638166310936584
##################
Dot Product: 0.9575284860607652
##################
Dot Product: 0.9625214482728341
##################
Dot Product: 0.9787110230696572

It’s odd that pure noise correlates well with pure noise, isn’t it?

There is probably a Nobel Prize or Turing Award waiting for anyone that can answer this question.

Topic		Replies	Views
Question on text-embedding-ada-002 API	12	6127	December 24, 2023
Why `OpenAI Embedding` return different vectors for the same text input? API	35	8388	April 30, 2024
Can text-embedding-ada-002 be made deterministic? API embeddings , ada	18	6547	December 24, 2023
Embeddings and Cosine Similarity API	20	13296	February 25, 2024
It looks like 'text-embedding-3' embeddings are truncated/scaled versions from higher dim version API embeddings , tips-and-tricks	46	8510	May 26, 2024

Some questions about text-embedding-ada-002’s embedding

Related Topics