Some questions about text-embedding-ada-002’s embedding

curt.kennedy · October 22, 2023, 3:54pm

OK, the non-proceeding-space token is only on new lines, not beginning of window.

Here is the proof of this using the tokenizer:

Anyway, I still think it’s a good idea to use a space before embedding. Since without it, you are telling the model that the string is the START of a New Line. Which is not the general or expected case for most things.

Examples?

I am getting good correlations if I put in both a leading and a trailing space.

Here is my Good/Better/Best with a simple example:

Good:

Msg0 = " monkey ate a banana"
Msg1 = " banana"
Dot Product: 0.8202837430736987

Better:

Msg0 = "monkey ate a banana"
Msg1 = "banana"
Dot Product: 0.8666851902396663

Best:

Msg0 = " monkey ate a banana "
Msg1 = " banana "
Dot Product: 0.8848283562929536

Hmm, so Leading and Trailing space???

Foxalabs · October 22, 2023, 4:11pm

I almost posted a query about a trailing space! It makes sense.

curt.kennedy · October 22, 2023, 4:14pm

Yeah, I think it makes sense too. Basically you are telling the model that this string is basically free-floating and not beginning or ending.

This “free floating” string probably has the best embedding performance. Obviously we need more testing on this, but it makes sense!

curt.kennedy · October 22, 2023, 4:31pm

OK, so this is super weird. But I think I can rationalize why it makes sense

Stopping only:

Msg0 = " onBindViewHolder"
Msg1 = " Bitcoin array_reset break FFT2"
Dot Product: 0.49305627282256903

Free-floating:

Msg0 = " onBindViewHolder "
Msg1 = " Bitcoin array_reset break FFT2 "
Dot Product: 0.6569292778765826

Exact start/stop:

Msg0 = "onBindViewHolder"
Msg1 = "Bitcoin array_reset break FFT2"
Dot Product: 0.5980304880913264

Beginning only:

Msg0 = "onBindViewHolder "
Msg1 = "Bitcoin array_reset break FFT2 "
Dot Product: 0.682720224335453

anon10827405 · October 22, 2023, 4:36pm

I don’t think a trailing space makes sense. When is a space by itself? In a typical document each space is joined with the following word.

I am very curious about onBindViewHolder.

It’s a very common function for android programming. The interesting part is when it is compared to itself but without the space it has quite the distance.

['onBindViewHolder', ' onBindViewHolder']
0.6921977855602539

My initial thought was “typically it’s adapter.onBindViewHolder”, and that any occurances of " onBindViewHolder" would be overrides.

BUT, the next strange thing is that

Bitcoin array_reset break FFT2 Is a mix of programming as well.

I don’t know if this means anything, but I tried pasting these values into the playground using Davinci & Curie to see how it completes it and in both cases they are related to programming (by finishing the completion)

gruff · October 22, 2023, 4:38pm

I’d be cautious interpreting too much from the tokens used. It’s the embedding you end up with that matters.

My tests were using full sentences. Maybe for single words it is better (which could make sense).

I guess the lesson here is always test, and with data pertinent to the application.

curt.kennedy · October 22, 2023, 4:40pm

This is because onBindViewHolder is a function call at the beginning of the line, with no identation.

What this says is that the embedding engine is very sensitive to the expected indentation of things. Do they lead a line, end a line, free-float within a line?

To be safe, in general, free-float might be the safest choice.

But in programming, beginning a line is probably the best choice.

anon10827405 · October 22, 2023, 4:45pm

Right. The strange issue is that if I try other similar words the distance isn’t as insane.

[' ComponentActivity', 'ComponentActivity']
0.9426979053444036

[' MainViewModel', 'MainViewModel']
0.9473338197715526

[' onViewRecycled', 'onViewRecycled']
0.9167214771947644

What the heck shares a close space with " onBindViewHolder" ?!?!?!

curt.kennedy · October 22, 2023, 5:03pm

Right, yeah that is insane. Maybe the model was undertrained on Android code? Similar to the mysterious ' petertodd' glitch token that it looks like OpenAI patched (ref).

Here is from C, which I assume it had tons of training. But the spacing around the item does seem to make sense. You expect int main() to be only at the beginning of the line.

This is the expected alignment:

Msg0 = "int main()"
Msg1 = "int main() "
Dot Product: 0.9661331321679516

Not expected for this to end a line:

Msg0 = "int main()"
Msg1 = " int main()"
Dot Product: 0.9161536962661113

Not expected it to free-float within a line:

Msg0 = "int main()"
Msg1 = " int main() "
Dot Product: 0.907106235610139

Let’s remember that each embedding is a vector. So here I keep the same vector for Msg0 and vary the other Msg1 vector.

So the wandering is with respect to the embedding of this second vector, and the model is trying to communicate the expected line positioning.

This is interesting!

But it tells me, that maybe for straight up sentences, you might be OK with no spacing on either side.

For word or sub-phrase searches, you might be better off with a leading and trailing space before embedding.

If anything, maybe leading and trailing spaces on the query embedding, and don’t worry about spacing on your data, just chop it exactly with no added spacing. Does that make sense?

@Foxalabs You may not have to re-embed your 80 Gigs of data after all! Maybe just add some padding in the query? Maybe?

anon10827405 · October 22, 2023, 5:21pm

Great article, thank you!

The top comment is interesting:

One theory I haven’t seen in skimming some of the petertoddology out there:

There is an fairly prominent github user named petertodd associated with crypto, and the presence of this as a token in the tokenizer is almost certainly a result of him;
Crypto people tend to have their usernames sitting alongside varied crytographic hashes on the internet a lot;
Cryptographic hashes are extremely weird things for a transformer, because unlike a person a transformer can’t just skim past the block of text; instead they sit there furiously trying to predict the next token over and over again, filling up their context window one 4e and 6f at a time.

So some of the weird sinkhole features of this token could result from a machine that tries to reduce entropy on token sequences, encountering a token that tends to live in strings of extremely high entropy.

I think a big reason why " onBindViewHolder" is such an outlier is because it is it’s own token ([58594]) vs the stripped ([263, 10154, 20867])

I also want to agree with this theory:

Stuff like (woman-man)+king = queen works in embeddings vector space.

However, the vector (woman-man) itself does not correspond to a word, it’s more something like “the contextless essence of femininity”. Combined with other concepts, it moves them in a feminine direction. (There was a lot of discussion how the results sometimes highlight implicit sexism in the language corpus).

Note such vectors are closer to the average of all words - i.e. the (woman-man) has roughly zero projections of direction like “what language it is” or “is this a noun” and most other directions in which normal words have large projection

Based on this post, intuitively it seem petertodd embedding could be something like “antagonist - protagonist” + 0.2 "technology - person + 0.2 * “essence of words starting by the letter n”.…

so onBindViewHolder is not a combination of anything, because it’s its own …lonesome … thing

Another loner (found in the comments) is " ForCanBeConvertedToForeach" which converts into [80371]

This one is apparently a “glitch token”. Which " onBindViewHolder" isn’t.

This is super cool. So these tokens are commonly called “glitch tokens” and have variations. Some “unspeakables” which causes cGPT to freeze up, and some polysemantic (as seen above). The one common feature they all share (including " onBindViewHolder") is that they are a single token. They almost all relate to programming as well.

Would it make sense that all of these were tokenized, and then during the training process they were discarded and never seen by the model?

By giving it some delicious context it jumps dramatically back
[' onBindViewHolder()', 'onBindViewHolder']
0.8991457379833502

I think so. Without a leading space it may be that the embedding model already is bias towards the start of each document. I still feel like a trailing space needs some testing as [ typically][,][ tokens][ are][ split][ like][ this]

curt.kennedy · October 22, 2023, 5:39pm

I agree that the trailing space doesn’t add as much value as the leading space.

Also let’s not forget about casing!

Maybe the best strategy is to lower case the query and have a single leading space in the query string before embedding. But this is in the general language context, not code search, where you probably want to keep the casing (and with code, probably don’t put in the leading space, unless you are sure the code query is strictly non-starting in a line – sounds like you need a classifier to predict this )

_j · October 22, 2023, 8:04pm

Is increased separation between two unrelated strings, by including a similar token in each, a worse result?

curt.kennedy · October 22, 2023, 8:12pm

The same <|endoftext|> just artificially makes them closer. You could use any token really. The goal is not to try and artificially make things closer!

_j · October 22, 2023, 8:14pm

It’s not really an outlier, except for the particular case that it was one of those that excretes nonsense from an AI. There’s plenty of code-specific tokens that ultimately can be traced back to one developer.

| 088335 | 018 | ‘.destroyAllWindows’ |

A fun phrase, but it won’t mean “destroy”. It is from OpenCV - a computer vision software package. On a normal word2vec embedding with a million plus tokens for every single case, this would have the semantics of the surrounding code and inferred purpose.

anon10827405 · October 22, 2023, 8:28pm

I meant in regards to comparing itself to another with just whitespace in front and how “far out” it is to everything else. Not that it’s a single token. Quite literally that it’s an outlier.

['.destroyAllWindows', ' .destroyAllWindows']
0.8117688686367641

I think some of these single tokens were never trained on and cause some really weird results. But not sure.

anon10827405 · October 22, 2023, 8:45pm

New record ( kind of cheating )

Msg0 = " onBindViewHolder"
Msg1 = " Cosmic Phenomena - Stuff that exist outside of our Earth, or the Moon. This includes stars, aliens, and other extraterrestrial occurrences."
0.46412970174830337

gruff · October 23, 2023, 12:39pm

anon10827405:

New record ( kind of cheating )

Msg0 = " onBindViewHolder"
Msg1 = " Cosmic Phenomena - Stuff that exist outside of our Earth, or the Moon. This includes stars, aliens, and other extraterrestrial occurrences."
0.46412970174830337

Lol. Someone should start a leaderboard

That’s almost 62⁰ apart… which, if we can interpret this as an apex angle of a hypercone θ=31⁰ swept around a central axis has mind-blowing geometric consequences…

I’ll need to do a bit of work to be able to calculate the exact result for the 1536-ball / 1535-sphere (going above d=1023 isn’t possible even with 64-bit double precision), but I’ve explored this geometry up to that and it’s bonkers.

It seems like a modest-sized cone in 3D, but it’s simultaneously absolutely vast and tiny at the same time in this high a dimension. Took me a while to wrap my head around the result. I thought it was numerical precision errors accumulating (the calculations need things like the gamma function which is just factorial for even dimensions), but it turns out it was correct and provable analytically.

I won’t give spoilers until I’ve calculated the value, but it’s fun to try to guess the curves for volume and area of n-balls based on:

D \mid D \in \left [ 0..3 \right ]

If you like a challenge, try to intuit the curve shapes for area and volume using only the identities for area and volume up to 3D without looking up how it scales with D.

EricGT · October 23, 2023, 7:45pm

Summary created by AI.

The discussion started when the user vanehe08 raised a question about the similarity score obtained between two semantically different sentences using OpenAI’s text-embedding-ada-002 model being higher than expected. This raised concerns about how the model generates embeddings.

Curt.kennedy noted a similar experience where the engine did not have a wide angular distribution and wondered whether the vector space was dedicated to other factors besides semantic similarity. He also mentioned embeddings seeming to focus more on the length of text over semantic similarities.

Ruby_coder elucidated that the embedding vectors might not directly interpret textual semantics, but could represent a model derived from a vast dataset. He highlighted the model’s performance by comparing embeddings of different texts and argued that the cosine similarities were within expectations considering the broad range of the internet’s global textual data.

However, others like curt.kennedy and ruby_coder still expressed concerns over the limited range of cosine similarity scores, which seemed only to use around 15% of the hypothetical range. They noticed that the most dissimilar texts still had positive correlation.

ruby_coder demonstrated alternative methods to the dot product or cosine similarity when comparing OpenAI embedding vectors, suggesting usage of the Euclidian Distance as an alternative that might have a larger dynamic range.

debreuil examined the embedding vectors and suggested looking into Euclidian Distance as a potential measure for better dynamic range. But several users, including curt.kennedy, continued to note that the range of cosine similarities remained limited and questioned the interpretation of embedding space.

curt.kennedy listed a paper related to the isotropic nature of embeddings and suggested that application of Principal Component Analysis (PCA) can enhance the embeddings. They also provided a Python code script to implement this process.

raymonddavey weighed in on the conversation and offered insights about the interoperability between embeddings and languages. They mentioned that if the language in which the text is embedded is not the same as the question language, the vectors might not match well. They recommended translating the question into the languages of the documents to increase the accuracy of the results, and provided corresponding examples.

curt.kennedy came back arguing about the importance of leading space when embedding text with ada-002. Leaving the leading space defines the meaning as in the overall context of the document rather than rarer cases when the text begins the document. He believed ensuring the leading space could improve dynamic range and potentially render more contextually accurate embeddings.

Summarized with AI on Oct 23
AI used: gpt-4-32k

_j · October 24, 2023, 7:52pm

I had forgotten for a while that you can also do completions on the embeddings models, and it seems like -ada-002 gives more than the absolute nonsense of GPT-3 embeddings sometimes.

Try this on your embeddings input: end it with two linefeeds.

On embedding just a few tokens, it improves the next token completion output, but degrades them on longer sentences.

<what if god was one of us>{
  "\n\n": -1.6626658,
  ",": -2.22747,
  "...": -2.3320994,
  "God": -2.4043431,
  " God": -2.7341342
}

<what if god was one of us

>{
  "God": -0.30231446,
  " God": -2.3034446,
  "One": -3.8759334,
  "god": -4.5286603,
  "\n\n": -4.627147
}
>>>

<with drops of Jupiter in her hair>{
  ",": -1.4668792,
  "\n\n": -1.7572559,
  "J": -3.0486264,
  "D": -3.4795637,
  " (": -3.5395741
}
<with drops of Jupiter in her hair

>{
  "J": -0.83964837,
  "a": -3.1410818,
  "in": -3.1977024,
  "j": -3.3355627,
  "\n\n": -3.5863523
}

curt.kennedy · October 24, 2023, 8:21pm

I don’t think this is possible.

text-embedding-ada-002 only returns vectors.

What am I missing?

Topic		Replies	Views
Question on text-embedding-ada-002 API	12	6527	December 24, 2023
`text-embedding-ada-002` API	23	17389	February 6, 2024
Can text-embedding-ada-002 be made deterministic? API embeddings , ada	18	8282	December 24, 2023
Why `OpenAI Embedding` return different vectors for the same text input? API	35	11265	April 30, 2024
Embeddings and Cosine Similarity API	20	15000	February 25, 2024

Some questions about text-embedding-ada-002’s embedding

Related topics