Some questions about text-embedding-ada-002’s embedding

Ah, right, I realise I hadn’t quite understood what you’d done. So, you’re generating pairs of embeddings which are strings of random tokens up to length K and comparing cosine similarities.

That’s an interesting experiment.

I’m still thinking about this.

I included a bunch of garbage strings in my experiments, assuming they would end up far away from each other (motivated by the observation that the junk strings @anon22939549 found (" onBindViewHolder" and " segments_remain doseima meaning") were not only the furthest away from all of my real sentences, but also from each other - but, if you look at the heatmap of pairwise separations I posted upthread, all my junk strings actually ended up quite close to each other but not to the sensible strings. That seems more like your results?

Maybe it isn’t “pure noise” in the sense you original meant (i.e. random, uncorrelated) but more like “brown noise” or “pink noise”, i.e. your token strings have a specific recognisable form to the model? It could be that constructing strings from random tokens produces - on average - strings that have a shared sense of garbledness? If so, it would make sense for the model to interpret and embed those in a similar way.

I would still expect longer texts to end up closer if they were broadly similar in theme (e.g. two paragraphs from the same article, say), but I would expect two long texts from very different domains to end up further away (say, some javascript versus a shakespeare sonnet of similar length).

One thing I just noticed is that Jake’s examples both start with a leading space and I’m wondering if that might be a factor. (The GPT playground always used to warn against starting a prompt with a space - I don’t recall the exact warning but it made the model behave badly). I think I read somewhere that starting with specific other characters can cause issues too (e.g. “.” I think - I’ll have to double check).

I think testing how the embeddings behave with token length needs to be motivated by more concrete and explicit semantic representation hypotheses to shed light. Token length seems too abstract to infer much from.

E.g. I want to know (thinking aloud):

  1. Do some bases represent “exotic features” that aren’t useful for my (presumably most developers’) needs (such as word frequency etc.), and
  2. If so, which have human-interpretable meanings and what are they (“this is not a real word” could well be such a feature, so could “this is probably a typo” which might be useful.
  3. Does the encoder use an attention mechanism to do the embedding in order that it can encode related-but-distal features (this would be necessary for some syntactic patterns in NLP, audio, and vital for 2D and 3D inputs like image and video). It seems unlikely (bootstrapping problem) but earlier models could provide the attention weights, and this is a significant iteration on earlier embeddings, so it’s not inconceivable…
  4. How do specific linguistic features (semantic and morphosyntactic) get represented, and how sophisticated are they (word level semantics, N-word level, clause-level, sentence level etc.)?
  5. How does this representation depend on /vary with things like position, text length, surrounding context and other factors?
  6. How do embeddings change as texts diverge (e.g. two texts that start the same but past a certain point become completely different things - can’t think of a good example off hand but the sort of thing where you think you’re reading one thing and realise it’s something completely different).
  7. Is the embedder smart enough to distinguish semantically critical punctuation, or is it mostly “soft on” punctuation (“A man eating tiger”, versus “A man-eating tiger” etc. but more ambiguous examples would be better).
  8. How much does the first token/word/character influence the embedding and are there pitfalls and gotchas around this?

I don’t know if any of these questions are relevant/interesting to your/other people’s projects? Feel free to expand the list if these aren’t direcly relevant to you. Useful to share ideas.

Gruff

1 Like

I just used a “Jake” in a forum example elsewhere but that’s not me :upside_down_face:

The tokens that AI will emit within sentences are words that " begin" " with" " a" " space", very common. Maybe this was seen as better encoding than ending with a space, as just as many endings may be periods, commas, newlines. The tokens with spaces first are very common; I didn’t end with a space token. AI could have self-trained on newlines with space tokens within one line.

I was going to “re-demonstrate” but I remember that I cranked up the penalties. davinci-002 repeats. More tokens on newlines starting with a space:
image

The notable thing is how willing AI is to emit a new line or period or comma after being fed no meaning. Nonsense = newline semantics?

“repeating back nonsense” is the headspace of the untrained model. However, I use a list of common token numbers in 900 (of 100k), and with four, we get completion:
image

For the language model the first n+1 is just an “L”, so there’s likely no semantic idea behind that 3% probable top token. It would be interesting if embeddings had a generator also.


One thing I was pondering about the unseen optimizations in those specific GPT-3 embeddings models, , is indeed the attention layer tuning of transformers. That could improve quality, for the example of a model meant for pulling similar vectors from long and short texts. Method: find not what should be completed, but what is in the middle.

And finally one unrealized experiment: is an unfinished sentence input better than than one that evokes a meaning of “this passage is done being written”.

It puts the embedding in a different space when strange technical terms are used. Also the leading space seems to be key for whatever reason.

For example, I can get a cosine similarity less than 0.5 just using this pair:

Msg0 = " onBindViewHolder"
Msg1 = " Bitcoin array_reset break FFT2"
# Dot Product: 0.4928356289506114

So, much less than Jake’s 0.63. Sorry @anon22939549 :crazy_face:

I’m thinking the “garbled stuff is all the same” theory is correct. So maybe ada-002 has so much dynamic range that our simple brains cannot see the whole thing. :rofl:

There is so much uncharted backscatter in the model it appears.

But still for everyday mortals, the language fed into the model is very correlated.

@gruff As for your questions, I don’t know. What is your application, and why do you need to know all these details?

Anyway, maybe a good takeaway is if you want more dynamic range out of your embeddings, add a leading space to your string.

Seems arbitrary, but OpenAI has a few blurbs in the docs about starting strings with spaces is good. In fact you had to do this when fine-tuning the original base models. Now it doesn’t appear to be the case, but maybe they insert them anyway on the backend without you knowing.

Yo Broseph, gimme some space! :rofl:

1 Like

I think that was Wolframs meaning when he quoted the 10^600 number, I don’t think he was specifically talking about ada at the time but he was certainly speaking with OpenAI.

He was essentially saying that the search space of ideas able to be expressed/explored there was so large that the entirety of human interest would be a pinpoint, and that AI’s when sufficiently intelligent may spend time exploring other places in that space for things we don’t see.

1 Like

OK, here is a quick follow-up on why I now think it’s important and crucial to make sure there is a leading space in everything you embed with ada-002.

Doing some quick research on the tokenizer reveals that when something starts without a space, it means that it is the beginning of a document, which puts it in a totally different frame in the embedding space, compared to the normal (expected) meaning of the thing being embedded.

So since I am after expected meaning, not the specific situation where this text blob starts out a document, you have to put in the leading space.

Whether or not something starts out a document even has separate tokens! Because the training data likely has seen less instances of what you embed as STARTING the document, puts the entire embedding in question IMO, since this is much rarer than it appearing inside the document after the start.

Let me demonstrate with the 100k tokenizer used by all the newer models:

Here is the no-space version (start of document):

Here is the same word appearing with a space in front, as you would expect elsewhere in the document, which is the much more common case:

So it is represented with completely different tokens.

So adding the extra space should improve the dynamic range of your embeddings, and I think it essentially boils down to that when you add the space, you are exposing the string to the overall context and meaning, and not the specific (much rarer) case of it only occurring at the beginning of a document.

This new family of embeddings should have more dynamic range (more diversity in cosine similarity) and I would argue is what people really want out of their embeddings, because most people are not only looking at the situation where the string starts out in a document.

So in any future embedding projects, I am going to make sure there is always a leading space in whatever I embed. This will produce a higher diversity in the embedding vector, and should theoretically impart more meaning and semantic relavence.

PS This also means when you use tools like LangChain or services like Weaviate, and if they apply the .strip() on strings before it is embedded, you are LOSING OUT!

It should really be:

EmbedThis = f" {Original.strip()}"

2 Likes

Jeeesh… that might be absolutely huge… now I have to go re embed about 80 gigs of data!

1 Like

So… would this also make sense for the query term as well? i.e. insert the space in front of the first word of a search term?

Yes, I would put the space in front of the query too. Basically make it part of your pre-processing code anytime you embed.

So first run .strip() on the string. Then resolve all whitespace to a single space
' '.join(mystring.split()). Then on this output, produce:

EmbedThis = f" {Preprocessed.strip()}"

You could probably do this in one big ugly line of Python too. :sweat_smile: But I won’t go there.

But yes, put in that leading space on anything you plan on embedding.

Based on just playing around last night and this morning, it makes a huge difference in expressing the embedding correctly.

1 Like

I think it might mean the beginning of a paragraph, not document (depending on what one defines a document as)?

This makes sense, so it can distinguish between words that begin sentences within a paragraph from words that open the paragraph.

That said, may have different encodings for all sorts of starting sequences given that it needs to represent code too.

Worth checking (using quasi regex notation):

  • “\n\s(word)”
  • “\n\n\s(word)”
  • “\t(word)”
  • “\t\t(word)”
    etc.
1 Like

If you want to have random fun and profit with AI, send <|endoftext|> as your first token when doing embeddings. That’s the secret sauce in some other models, but you’d think OpenAI would already optimize for such.

I depends what your application does whether this will help.

The reason why I think it might be start of document, is that you will always have preceding white space in every paragraph except for the first.

And I doin’t think the model was exclusively trained on single paragraphs, but it was likely trained on multiple paragraphs to get it to the 8k - 16k - 32k context windows.

So when I say “document” I am really meaning the start of the 8k, etc., context window, not literally start of a huge document.

UPDATE: It’s really the start of a new line, not start of window (see post below).

1 Like

[UPDATE: I misconstrued the meaning of the colours. In fact, adding a leading space seems to mean “beginning of a new sentence” and no leading space means “beginning of a new document” AFAICT]

Eeeyew.

I just did a couple of tests with the online tokenizer. It seems to be more nuanced than we thought. I’d hazard a guess that it depends on the word frequency and that if it’s a very frequent word, it encodes positional information (seems like a good idea) but obviously this leads to a combinatorial explosion, so this tactic is only used for the top N frequent words.

Are you basing this off the color?

You may want to use this instead:

https://tiktokenizer.vercel.app/

1 Like

What you are seeing in the first screenshot is that “A” followed by a spaced token on the start of a new line is always its own token. The newline characters aren’t part of the start of the BPE token, they are the end of what comes before, and can be joined with the previous ending.

Frequency in corpus is directly related to the token number, but with some of the 100k manually placed there.

number len(chars) text
000318 003 ‘);\n’
000341 002 ‘)\n’
000342 003 ’ {\n’
000346 002 ‘,\n’
000383 003 ‘.\n\n’
000398 002 ‘>\n’
000402 003 ‘;\n\n’
000513 002 ‘:\n’
000518 002 ‘{\n’
000546 004 ‘();\n’

One can ponder that code was not pre-stripped of newlines, while book text might have been preprocessed to just words, so you have lots of code newlines.

Kinda, I just inferred the “\n” counts from the colour of the token for the word. I’m just trying to get a sense of how and when positional information is used in the encoding.

It seems it’s not just word frequency but possibly Part of Speech and Polysemy (how many meanings a word can take on). The ranks, POS and polysemy below are for English as a natural language and don’t take into account e.g. their use in code (especially “and” and “not” for example):

The color is simply based on the index.

Screenshot from 2023-10-22 10-52-46
Screenshot from 2023-10-22 10-52-38

Screenshot from 2023-10-22 10-51-05

It would have been cool if they correlated the token id with the color though.

Side note: Adding a “.” infront of the word sometimes often combines them as well, possibly from programming?

1 Like

Thanks for pointing that out! I assumed it indicated the token colour.

I’ll repeat the tests :yum:

Thanks for that link too - very useful. It’s a shame they don’t use the same colour for distinct tokens either… but they do have a nice hover feature to check, so it’s an improvement on the OpenAI online tokenizer.

I was referring to Jake Elmstedt (@anon22939549) :slight_smile:

I don’t think you need to re-encode all your documents. With no leading space in the query it will tend to look for documents starting with your query. Adding a leading space it will tend to look for documents containing the query. So, depending on you app needs, you can either expose this a search option, or create two query embeddings (one with the space) and combine the top ranking results.

[UPDATE: I think you’ll need to run some experiments. See below.]

I compared cosine similarities for queries and documents with and without leading spaces, adding a space to either the query or the document resulted in worse results (i.e. lower CS values, higher angle of separation).

Bit perplexed.

1 Like

What was this supposed to do? It just made the embedding worse. It took my 0.49 separation and blew it up to 0.72

Msg0 = "<|endoftext|> Bitcoin array_reset break FFT2"
Msg1 = "<|endoftext|> onBindViewHolder"
Dot Product: 0.7236468704384156