Anyway, I still think it’s a good idea to use a space before embedding. Since without it, you are telling the model that the string is the START of a New Line. Which is not the general or expected case for most things.
Examples?
I am getting good correlations if I put in both a leading and a trailing space.
Here is my Good/Better/Best with a simple example:
Good:
Msg0 = " monkey ate a banana"
Msg1 = " banana"
Dot Product: 0.8202837430736987
Better:
Msg0 = "monkey ate a banana"
Msg1 = "banana"
Dot Product: 0.8666851902396663
Best:
Msg0 = " monkey ate a banana "
Msg1 = " banana "
Dot Product: 0.8848283562929536
I don’t think a trailing space makes sense. When is a space by itself? In a typical document each space is joined with the following word.
I am very curious about onBindViewHolder.
It’s a very common function for android programming. The interesting part is when it is compared to itself but without the space it has quite the distance.
My initial thought was “typically it’s adapter.onBindViewHolder”, and that any occurances of " onBindViewHolder" would be overrides.
BUT, the next strange thing is that
Bitcoin array_reset break FFT2 Is a mix of programming as well.
I don’t know if this means anything, but I tried pasting these values into the playground using Davinci & Curie to see how it completes it and in both cases they are related to programming (by finishing the completion)
This is because onBindViewHolder is a function call at the beginning of the line, with no identation.
What this says is that the embedding engine is very sensitive to the expected indentation of things. Do they lead a line, end a line, free-float within a line?
To be safe, in general, free-float might be the safest choice.
But in programming, beginning a line is probably the best choice.
Right, yeah that is insane. Maybe the model was undertrained on Android code? Similar to the mysterious ' petertodd' glitch token that it looks like OpenAI patched (ref).
Here is from C, which I assume it had tons of training. But the spacing around the item does seem to make sense. You expect int main() to be only at the beginning of the line.
Let’s remember that each embedding is a vector. So here I keep the same vector for Msg0 and vary the other Msg1 vector.
So the wandering is with respect to the embedding of this second vector, and the model is trying to communicate the expected line positioning.
This is interesting!
But it tells me, that maybe for straight up sentences, you might be OK with no spacing on either side.
For word or sub-phrase searches, you might be better off with a leading and trailing space before embedding.
If anything, maybe leading and trailing spaces on the query embedding, and don’t worry about spacing on your data, just chop it exactly with no added spacing. Does that make sense?
@Foxalabs You may not have to re-embed your 80 Gigs of data after all! Maybe just add some padding in the query? Maybe?
One theory I haven’t seen in skimming some of the petertoddology out there:
There is an fairly prominent github user named petertodd associated with crypto, and the presence of this as a token in the tokenizer is almost certainly a result of him;
Crypto people tend to have their usernames sitting alongside varied crytographic hashes on the internet a lot;
Cryptographic hashes are extremely weird things for a transformer, because unlike a person a transformer can’t just skim past the block of text; instead they sit there furiously trying to predict the next token over and over again, filling up their context window one 4e and 6f at a time.
So some of the weird sinkhole features of this token could result from a machine that tries to reduce entropy on token sequences, encountering a token that tends to live in strings of extremely high entropy.
I think a big reason why " onBindViewHolder" is such an outlier is because it is it’s own token ([58594]) vs the stripped ([263, 10154, 20867])
I also want to agree with this theory:
Stuff like (woman-man)+king = queen works in embeddings vector space.
However, the vector (woman-man) itself does not correspond to a word, it’s more something like “the contextless essence of femininity”. Combined with other concepts, it moves them in a feminine direction. (There was a lot of discussion how the results sometimes highlight implicit sexism in the language corpus).
Note such vectors are closer to the average of all words - i.e. the (woman-man) has roughly zero projections of direction like “what language it is” or “is this a noun” and most other directions in which normal words have large projection
Based on this post, intuitively it seem petertodd embedding could be something like “antagonist - protagonist” + 0.2 "technology - person + 0.2 * “essence of words starting by the letter n”.…
so onBindViewHolder is not a combination of anything, because it’s its own …lonesome … thing
Another loner (found in the comments) is " ForCanBeConvertedToForeach" which converts into [80371]
This one is apparently a “glitch token”. Which " onBindViewHolder" isn’t.
This is super cool. So these tokens are commonly called “glitch tokens” and have variations. Some “unspeakables” which causes cGPT to freeze up, and some polysemantic (as seen above). The one common feature they all share (including " onBindViewHolder") is that they are a single token. They almost all relate to programming as well.
Would it make sense that all of these were tokenized, and then during the training process they were discarded and never seen by the model?
By giving it some delicious context it jumps dramatically back [' onBindViewHolder()', 'onBindViewHolder'] 0.8991457379833502
I think so. Without a leading space it may be that the embedding model already is bias towards the start of each document. I still feel like a trailing space needs some testing as [ typically][,][ tokens][ are][ split][ like][ this]
I agree that the trailing space doesn’t add as much value as the leading space.
Also let’s not forget about casing!
Maybe the best strategy is to lower case the query and have a single leading space in the query string before embedding. But this is in the general language context, not code search, where you probably want to keep the casing (and with code, probably don’t put in the leading space, unless you are sure the code query is strictly non-starting in a line – sounds like you need a classifier to predict this )
The same <|endoftext|> just artificially makes them closer. You could use any token really. The goal is not to try and artificially make things closer!
It’s not really an outlier, except for the particular case that it was one of those that excretes nonsense from an AI. There’s plenty of code-specific tokens that ultimately can be traced back to one developer.
| 088335 | 018 | ‘.destroyAllWindows’ |
A fun phrase, but it won’t mean “destroy”. It is from OpenCV - a computer vision software package. On a normal word2vec embedding with a million plus tokens for every single case, this would have the semantics of the surrounding code and inferred purpose.
I meant in regards to comparing itself to another with just whitespace in front and how “far out” it is to everything else. Not that it’s a single token. Quite literally that it’s an outlier.
Msg0 = " onBindViewHolder"
Msg1 = " Cosmic Phenomena - Stuff that exist outside of our Earth, or the Moon. This includes stars, aliens, and other extraterrestrial occurrences."
0.46412970174830337
That’s almost 62⁰ apart… which, if we can interpret this as an apex angle of a hypercone θ=31⁰ swept around a central axis has mind-blowing geometric consequences…
I’ll need to do a bit of work to be able to calculate the exact result for the 1536-ball / 1535-sphere (going above d=1023 isn’t possible even with 64-bit double precision), but I’ve explored this geometry up to that and it’s bonkers.
It seems like a modest-sized cone in 3D, but it’s simultaneously absolutely vast and tiny at the same time in this high a dimension. Took me a while to wrap my head around the result. I thought it was numerical precision errors accumulating (the calculations need things like the gamma function which is just factorial for even dimensions), but it turns out it was correct and provable analytically.
I won’t give spoilers until I’ve calculated the value, but it’s fun to try to guess the curves for volume and area of n-balls based on:
D \mid D \in \left [ 0..3 \right ]
If you like a challenge, try to intuit the curve shapes for area and volume using only the identities for area and volume up to 3D without looking up how it scales with D.
The discussion started when the user vanehe08 raised a question about the similarity score obtained between two semantically different sentences using OpenAI’s text-embedding-ada-002 model being higher than expected. This raised concerns about how the model generates embeddings.
Curt.kennedy noted a similar experience where the engine did not have a wide angular distribution and wondered whether the vector space was dedicated to other factors besides semantic similarity. He also mentioned embeddings seeming to focus more on the length of text over semantic similarities.
Ruby_coder elucidated that the embedding vectors might not directly interpret textual semantics, but could represent a model derived from a vast dataset. He highlighted the model’s performance by comparing embeddings of different texts and argued that the cosine similarities were within expectations considering the broad range of the internet’s global textual data.
However, others like curt.kennedy and ruby_coder still expressed concerns over the limited range of cosine similarity scores, which seemed only to use around 15% of the hypothetical range. They noticed that the most dissimilar texts still had positive correlation.
ruby_coder demonstrated alternative methods to the dot product or cosine similarity when comparing OpenAI embedding vectors, suggesting usage of the Euclidian Distance as an alternative that might have a larger dynamic range.
debreuil examined the embedding vectors and suggested looking into Euclidian Distance as a potential measure for better dynamic range. But several users, including curt.kennedy, continued to note that the range of cosine similarities remained limited and questioned the interpretation of embedding space.
curt.kennedy listed a paper related to the isotropic nature of embeddings and suggested that application of Principal Component Analysis (PCA) can enhance the embeddings. They also provided a Python code script to implement this process.
raymonddavey weighed in on the conversation and offered insights about the interoperability between embeddings and languages. They mentioned that if the language in which the text is embedded is not the same as the question language, the vectors might not match well. They recommended translating the question into the languages of the documents to increase the accuracy of the results, and provided corresponding examples.
curt.kennedy came back arguing about the importance of leading space when embedding text with ada-002. Leaving the leading space defines the meaning as in the overall context of the document rather than rarer cases when the text begins the document. He believed ensuring the leading space could improve dynamic range and potentially render more contextually accurate embeddings.
I had forgotten for a while that you can also do completions on the embeddings models, and it seems like -ada-002 gives more than the absolute nonsense of GPT-3 embeddings sometimes.
Try this on your embeddings input: end it with two linefeeds.
On embedding just a few tokens, it improves the next token completion output, but degrades them on longer sentences.
<what if god was one of us>{
"\n\n": -1.6626658,
",": -2.22747,
"...": -2.3320994,
"God": -2.4043431,
" God": -2.7341342
}
<what if god was one of us
>{
"God": -0.30231446,
" God": -2.3034446,
"One": -3.8759334,
"god": -4.5286603,
"\n\n": -4.627147
}
>>>
<with drops of Jupiter in her hair>{
",": -1.4668792,
"\n\n": -1.7572559,
"J": -3.0486264,
"D": -3.4795637,
" (": -3.5395741
}
<with drops of Jupiter in her hair
>{
"J": -0.83964837,
"a": -3.1410818,
"in": -3.1977024,
"j": -3.3355627,
"\n\n": -3.5863523
}