Ah, right, I realise I hadn’t quite understood what you’d done. So, you’re generating pairs of embeddings which are strings of random tokens up to length K and comparing cosine similarities.
That’s an interesting experiment.
I’m still thinking about this.
I included a bunch of garbage strings in my experiments, assuming they would end up far away from each other (motivated by the observation that the junk strings @anon22939549 found (" onBindViewHolder"
and " segments_remain doseima meaning"
) were not only the furthest away from all of my real sentences, but also from each other - but, if you look at the heatmap of pairwise separations I posted upthread, all my junk strings actually ended up quite close to each other but not to the sensible strings. That seems more like your results?
Maybe it isn’t “pure noise” in the sense you original meant (i.e. random, uncorrelated) but more like “brown noise” or “pink noise”, i.e. your token strings have a specific recognisable form to the model? It could be that constructing strings from random tokens produces - on average - strings that have a shared sense of garbledness? If so, it would make sense for the model to interpret and embed those in a similar way.
I would still expect longer texts to end up closer if they were broadly similar in theme (e.g. two paragraphs from the same article, say), but I would expect two long texts from very different domains to end up further away (say, some javascript versus a shakespeare sonnet of similar length).
One thing I just noticed is that Jake’s examples both start with a leading space and I’m wondering if that might be a factor. (The GPT playground always used to warn against starting a prompt with a space - I don’t recall the exact warning but it made the model behave badly). I think I read somewhere that starting with specific other characters can cause issues too (e.g. “.” I think - I’ll have to double check).
I think testing how the embeddings behave with token length needs to be motivated by more concrete and explicit semantic representation hypotheses to shed light. Token length seems too abstract to infer much from.
E.g. I want to know (thinking aloud):
- Do some bases represent “exotic features” that aren’t useful for my (presumably most developers’) needs (such as word frequency etc.), and
- If so, which have human-interpretable meanings and what are they (“this is not a real word” could well be such a feature, so could “this is probably a typo” which might be useful.
- Does the encoder use an attention mechanism to do the embedding in order that it can encode related-but-distal features (this would be necessary for some syntactic patterns in NLP, audio, and vital for 2D and 3D inputs like image and video). It seems unlikely (bootstrapping problem) but earlier models could provide the attention weights, and this is a significant iteration on earlier embeddings, so it’s not inconceivable…
- How do specific linguistic features (semantic and morphosyntactic) get represented, and how sophisticated are they (word level semantics, N-word level, clause-level, sentence level etc.)?
- How does this representation depend on /vary with things like position, text length, surrounding context and other factors?
- How do embeddings change as texts diverge (e.g. two texts that start the same but past a certain point become completely different things - can’t think of a good example off hand but the sort of thing where you think you’re reading one thing and realise it’s something completely different).
- Is the embedder smart enough to distinguish semantically critical punctuation, or is it mostly “soft on” punctuation (“A man eating tiger”, versus “A man-eating tiger” etc. but more ambiguous examples would be better).
- How much does the first token/word/character influence the embedding and are there pitfalls and gotchas around this?
I don’t know if any of these questions are relevant/interesting to your/other people’s projects? Feel free to expand the list if these aren’t direcly relevant to you. Useful to share ideas.
Gruff