Capturing Meaning other than Similarity (e.g., generalization) in vectors?

paolodim · June 27, 2024, 7:23pm

Yes. In natural languages, it’s also how multilingual speakers think when they are asked what one word means in another language. They basically subdivide both languages into Voronoi tessellations, and then determine where there is a correspondence between common regions. What I am interested in is much more fundamental, and I am not sure that it would be of interest since it is so abstract. I believe that “Language” is “a-priori” like the laws of physics (and part of the laws of physics). I am interested in the nature of objects that Shannon et all error correct on or for. NL is an instantiation of this. If I am being too crazy, I fully understand backing out of this discussion. Else, it’s enormous fun.

anon22939549 · June 27, 2024, 7:28pm

As a technical note, since the embeddings are all normalized, wouldn’t convexity of the paeudo-space they inhabit be implicit?

Anyway, here’s some resources you might appreciate which are related to this discussion. Some you’ve probably already read, but there might be some new ones to you in this list.

Diet · June 27, 2024, 7:41pm

Depends on how you define language, I guess. I can definitely see that some sort of universality might be at play here, and I’d definitely be curious to hear what you find.

Diet · June 27, 2024, 8:18pm

well, no. I don’t think so.

I’ll define convexity as requiring continuity. Assuming there was a “true”/“natural” cosine similarity to exist, I’ll assume that it would have some sort of practically infinite precision. As such, to construct a polytope of N items (your “embedding reference map”) , you’d need an N-1-dimensional space to fit it into. (yes there’s a limit to the domain, but I think it’s in the 10e6+ dims)

Now some of this precision is noise. So you can reduce the dimensionality of your representation space until you extinguish your noise floor. Until then, the space should be universally convex for the domain. But if you go further than that, you start clipping, and introducing discontinuities.

My guess is that somewhere at the edge, near underrepresented concepts, the arithmetic might start breaking down due to this clipping, indicating that you’re no longer in convex space. You’ll see this with these Matryoshka embeddings if you go way too low. I could of course be wrong, but I suspect that we’re not out of the water with these 10e3/10e4 dims.

thnx for the papers, I haven’t seen the last one yet.

paolodim · June 27, 2024, 8:46pm

Thank you, Diet, elmstedt. Just on a sidenote, regarding tools, could we put together a group and construct tools so that this is not just a bulletin board, but uses AI in the process?
For example, the suggestion by
elmstedt regarding papers, especially the last one is very interesting. Maybe this is already a tool, but if it’s not, then build it, and make it more powerful by integrating it using LLM‘s for example, to make what we are all discussing in these threads, more productive?

Topic		Replies	Views
Does it make sense to add context to text before embedding? API embeddings	10	2535	February 25, 2024
Idea of context for GPT 3 API API	15	3659	December 15, 2023
How to manage safety issues for large volumes of embeddings Community	5	1024	December 20, 2022
Can this api be used to query internal data? API	35	8489	April 20, 2023
Example incorporation into query formulation API	14	1359	December 16, 2023

Capturing Meaning other than Similarity (e.g., generalization) in vectors?

Related topics