import os
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")
prompt = "I like chatting with embeddings"
r = openai.Completion.create(
model="text-embedding-ada-002",
prompt=prompt,
max_tokens=2,
temperature=0,
logprobs=5
)
s = r['choices'][0]['text']
t = str(r['usage']).replace("\n", "")
print(f"<{prompt}>", end="")
print(r['choices'][0]['logprobs']['top_logprobs'])
Weird, so how would you use this? Token prediction?
It has a massive 8k context. So corner case of low-cost simple predictions on long documents?
The embeddings model doesn’t write more than its one token inner state thought, if you’re lucky a few tokens, so no, you don’t get the cheapest chat ever.
It does allow you another way to ponder inner workings though if it doesn’t just spit out spaces or numbers.
<A cute kitten>
Cat
Cat
Cat
You’d never find it, but somehow I stumbled upon the foundational paper of OpenAI’s GPT-3 embeddings from January 2022, by searching for an unreleased model name that was in another source.
It also gives us more figures for the ada-babbage-curie-davinci models going away and their underlying GPT-3:
It also has technique that may be different in the chat/3.5/mystery 1536-dimension model now used:
We insert two special token
delimiters, [SOS] and [EOS], to the start and end of the
input sequence respectively. The hidden state from the last
layer corresponding to the special token [EOS] is consid-
ered as the embedding of the input sequence.
Additional: Google doc of how these performed rather poorly compared to SOTA at the time
November 6 edit:
A new embed model from Cohere came out just a few days ago.
It basically scorches the Earth in English benchmarks, especially on query->document retrieval. Where text-embedding-ada-002 is #18.
It is free for rate-limited trial, and the same price per (their) token as ada if I math right. 1024 dimensions, sequence length 512 (vs OpenAI 8k)
“ticket support”
Maybe deterministic?
There’s another, GTE from Alibaba that I actually went looking for after reading a paper, but it is only evaluated on a few of the benchmarks yet, and places low.
But sad the Cohere models only support 512 input tokens max. Like 300 words or so.
Also, what is the performance difference if the MTEB score is 0.02 different. Is this even noticeable? The units aren’t defined. Are they just percentages?
Ada-002 has a value of 60.99, and this latest leader (from Cohere) is at 64.47.
Am I going to notice the 3.48 (percent?) difference?
What I am going to notice, is the much smaller context size. The smaller vector dimension size is noticeable, but not a driver for me.
I want to embed tons of documents! Not little paragraphs!
Thank you for this snippet, curt.
But as far as I understand, you are taking the top D dimensions. On the countrary (and counterintuitively), the arxiv paper you listed as its name suggests (“All-but-the-Top”) postulates to exclude these dimensions and keep all the other.
Am I right? Have you also tried their approach?
Thanks for pointing this massive error out!
I updated the code above with this line:
U = U[D:,:] # take All But The Top!
It is working much better now! (More spread)
But now I’m wondering how many dimensions I should really drop. It’s set to 15 right now, but whoever uses this needs to examine this situation in more detail.
PS, I am not using this in operations now, so I have very little insights. But the variance is higher and the cosine similarities still make sense when the ABBT are dropped.
I’m glad to discover results were even better.
I really would really appreciate to read more on how this is impacting the results, if you would like to share any other details on your experiments.
Moreover, I am still afraid that using my personal average vector, composed by a single language and limited topic, could not be as good as your mean vector.
I am wondering if we should (and could) share a public common “universal embedding vector” so that everyone interested could add their personal mean vector to update it and obtain a realistic vector.
What do you think about it?
512 is neatly about the limit of what OpenAI models call a “summary” when they write.
Summarizing could level out the predominance of one document with many chunks, and bring main points to the fore with less “repetition anti-penalty”. Summarize with 100k at $1 if you’ve got something to do with the whole document instead of AI context-loadable bites.
It’s funny that most of the discussions on this thread (except for the leading spaces thing) were also going on at the company I was working for. Glad we were not hallucinating lol.
BTW, not sure if you all ever tried to check how deterministic the Ada-002 model is. As an embedding model, it should be deterministic (because it’s a frozen layer of a Transformer), but we found that this is not the case if you are using the standard OpenAI API. Although, if you use Azure OpenAI, it gets deterministic. Speaking with an OpenAI Staff during the Dev Day, he said that this could be a bug in the Ada-002 standard API.
We have a write up about it. Hope it adds to the discussion:
I’ve got a lot of embedding projects, but the one I am talking about when I say “large documents” is basically a personal assistant or “personal digital brain” that has access to hundreds of thousands (maybe millions) of documents from books, lines of code, notes, etc, and since there would be so much data, and since model context windows will only go up, and prices go down, it’s best to stick with the larger chunks, and avoid the inevitable compression loss that occurs with summarization.
Also, as a technicality, I am running hybrid search, so Keywords and Embeddings, and the Keyword leg benefits massively with no compression.
But yeah, for smaller projects, embedding the summary might be good. It’s just an extra step, added complexity, and goes against the trend of models getting more powerful, larger windows, and cheaper to run.
So as time goes on, you just need to re-embed your data with whatever latest larger context engine, and this is really cheap, if not free if using open source. So this is a longer term vision as well. So when embedding model Z comes out with better metrics and half the vector dimension, you migrate to that one!
So with less embeddings (because of larger chunks) your search is also much faster. Because each vector takes time to assess in search.
But there is a point in “dilution” of larger documents. So if you want more attention for those detail laden documents, you simply choose to chunk it into smaller pieces.
So the information density per embedding is up to you as you embed, and not restricted by a limited embedding model.
So you can go big for giant pages and pages of stuff with lower information density per document (or chunk). Or embed smaller chunks to lower the overall information density per chunk for information rich documents.
So, like I said, it’s just an easy knob to declare a document a certain information density and chunk and embed appropriately. (You could probably do this programmatically based on keyword density)
Also, don’t forget embedding large chunks and smaller subset chunks at the same time in the same database. Here the information density per embedding vector varies considerably, you are now exposing “big picture” and “little picture” to the engine at the same time. It’s up to you!
So bottom line, having a large instantaneous dynamic range using a large context embedding model gives you so much leeway and freedom to create an optimal RAG solution. Optimal in terms of search speed (less vectors, faster) and optimal in terms of quality.
The only analogy I can think of, that is similar here, is Variable Bit Rate encoded audio files. With VBR you are able to encode information with varying bit density depending on the complexity of an audio frame being encoded. So overall to the user, they can fit more files on their device without any apparent loss in quality since the bits are distributed based on information need, and not simply uniformly allocated, and usually over provisioned leading to larger files without any real benefit.
Also don’t forget about embedding model redundancy. For example, if ada-002 goes down (another DDoS attack?) and you are in production, my system has a default Keyword based model that would suffice as a backup, but let’s say you just wanted to use embeddings. Well, you need another model. Either local or from another API.
The cool thing is you can run multiple embeddings off the same data, say partitioned differently for different models, and hybridize the returns across all the available models. So you run 10 search models, you blend them with RRF across all available endpoints to harden your production system against outages.
Anyway, my TED Talk is over. ![]()
I would keep your own personal embedding vector specific to each collection you are processing. Otherwise I’d worry your PCA representations would not be faithful to your corpus.
My thoughts on ABTT and other vector transformations is first, try to measure, with a single number, the amount of increased isotropy in you new collection.
So imagine averaging all your embedding vectors before transforming them. I’m thinking all the vectors are unit vectors on some high dimensional unit sphere (like in ada-002, and most other embedding engines).
So the magnitude of the average represents your initial spread, or diversity, from the initial raw embeddings from the engine. This will be a number less than one, say for example, it is 0.923423 (down from 1.0).
Then you run ABTT, or some other procedure, say push them back out to the unit sphere, and then average that new collection. You would expect a lower number if it’s more spread. Say the new number is 0.3324.
So a Figure of Merit (FOM) of this transformation, at least as far as isotropy is concerned, is, if you like dB’s is:
10log_{10}(PreviousAverageLength/NewAverageLength)
In this hypothetical example, I have 4.4 dB of improvement after transforming the embedding vectors. Not bad! This has no upper bound since it is theoretically possible to get an average vector very close to zero magnitude. But to get this, your original corpus has to be very diverse and densely diverse at the same time. So don’t expect 100 dB or anything!
Another metric, which is harder to measure, is the semantic one, and if the transformed vectors actually express more meaning or not. This is one I haven’t done personally, so that’s why I think it’s hard. So not sure how to quantify this easily or accurately without just “looking at the data”.
But intuitively, since the transformations are largely linear on your collection, you aren’t getting strange clustering, where there was none before. If anything you are breaking apart some of those clusters, but the ABTT procedure is saying there is more information in the long tails, rather than the biases in the last hidden layer of the model. This is a plausible hypothesis IMO.
I did a big pull to characterize ada’s non-determinism.
It likely has something to do with the 3.5-turbo and other bases and how they are put together. logits of 3.5-turbo-instruct also show its not deterministic to the degree that even top logits sometimes flip.
Embeddings seems like something I can even do on my little 2gb nvidia pascal with the right model.
From what I’ve seen, the ada-002 non-determinism is in the noise in terms of semantics. So minor variations in cosine similarity.
But if these minor variations are a big deal to you, you could try a couple things.
- Make all embeddings deterministic by creating a history of all past embeddings, and if a string comes in, you just use the previous embedding vector, or create a new embedding vector if there is no previous, and write this new embedding vector the the DB.
But this does sweep the problem under the rug if you are looking at microscopic differences in the embedding vectors.
So to remedy this, go with a MLE (maximal likelihood estimate):
- Keep track of all embeddings, ever done, and for all embeddings done on a particular string, use the “most likely version”. So you could go as exotic as crafting your own PDF, and maybe picking the embedding at the mode (or peak) of this PDF. Or, if lazy, just use the average, or maybe even some time weighted average if you expect model drift.
PS I don’t know why this non-determinism is happening. Guessing that even at temp=0 they are still letting in a little bit of noise. Some floor in the log-probs, like -100 dB, still is non-zero.
Hey curt, I’m really sorry to tell that sadly I gave you a wrong correction! You were totally right, your first version was correct, I realized it ready the paper and better understanding the shape of pca.components_ .
I now understand that the “all but the top” part refers to v_prime = v_tilde - v_projection with v_projection looping over the top D dimensions.
I am very sorry!
OK, I commented out the change and put the other back in ![]()
But you could try either one. I don’t know which version is truly right or wrong here.
They remove the “top” at first by removing the bias. But keep the remaining “top” using PCA. So they are keeping most of the variance of the original model by doing this.
But what if they actually tried to use “all but the top”, so remove the DC term and top PCA vectors. I can see that working as well.
So try either, the original and new changes are there.
It’s just a matter of where you think the embedding information really is.
Is it in the original hidden states of the model? Just minus the bias component?
Or should you nullify the entire DC bias and top components out, and look at the long tails instead?
Or should you ditch either and do your own analysis? ![]()
I calculated this but forgot to post it. I’m not sure this has practical value except as a reminder that high-dimensional spaces aren’t intuitive: what seems like a small cone (31° around its central axis) subtends way more surface area in 1356D than in 3D where it’s a fraction of 1 octant (in 2D, quadrant, 3D octant, the generic term for this is “orthant”).
The surface area relative to the whole n-sphere tends to zero very rapidly, but expressed as a ratio of the area of the sphere swept through one orthant, this explodes again after around 1400D, becoming around 10^16 times bigger than the sphere area in a single orthant. Wider cones explode much earlier, but 30 degrees is the threshold below which the whole cone cap is within a single orthant, and above which it starts to enter other orthants (hence the explosion as dimensions go up).
As I said, not sure this has practical value except as a reminder that objects in higher dimensions may be larger (and smaller) than they appear… ![]()
On a more practical note, in my tests, anything closer than 25° (cosine sim: 0.9) is close enough to be considered the same (for natural language expression, possibly not for, e.g. python code where indentation may matter), and things further apart than 37° are highly dissimilar. This seems like a terribly small range coming from smaller dimensions, but as the above charts show, in 1536D, that cone is actually massive - it’s just not at all intuitive that it would be…
You will be issued a fine of one internet point for confusing the Curt. Please do not do this again.
I’ve been combing the docs and forums looking for information on ada-002: its relationship to the GPTs, what it was trained on and how, etc… and it all seems, well, mysterious. Is there something out there and I just haven’t found it?
Hello everyone, sorry if my question is a bit off topic but i also got the same conclusion of the limited dynamic range of the ada-002 embedding. I was wondering what i coud do in my case when i stumbeled on this rich discussion.
I have to analyse short sentences (to be more specific, bank operation name) and compare them. I used ada-002 embedding and cosine similarity but given what you said in the previous talk, i think that it is a bit overkill.
As for what i did to counter this dynamic range problem, i just did a normalization.
Do you have any insight for me?
Any answer would be really helpfull.


