OpenAI Embeddings - Multi language

I am working on a project that requires embedding from different languages. Can I rely on OpenAI embeddings for different languages?

My main question is about similarity of the same sentence being embedded from different languages.

Is there any source I can refer to about this?

1 Like

From my own experience using embeddings, you can embed the data in whatever language and query it using different language and you will still get good result as long as you pass it to the chat completions API for summary. For example, my embedded data is English and Japanese then I can use a query say Spanish and still get my desired output.

It really depends on if the goal is to classify similarity (bitext) regardless of language, or if your goal is ancillary tasks, such as database retrieval.

LaBSE for example, tops the charts on measuring similarity between translated language passages, but ranks far below text-ada-embeddings-002 for retrieval, matching small-large, for example.

1 Like

My use case is basically a multi language RAG system. There can be mismatch between the language of query and retrieved document.

From experience, most people would probably say that it wouldn’t be a problem.

here are some numbers from another thread:

image

these are comparing bible passages (mark chapter 1 and 2 from different english translations) to each other. for your particular question, mark1_lu17 is interesting because it’s german, all the other ones are english.

Here are the conclusions from this tiny experiment:

  1. If you have a multilanguage corpus, you may get a higher score out of a slightly worse match if it’s the same language as the query

  2. If your corpus is all in the same language, you might get generally get slightly worse scores, but the rank order likely won’t be affected

Here are the takeaways:

  1. It’s probably best to have your corpus in a single language, even if it’s the wrong language, rather than having a patchwork of languages in your corpus (i.e. translate them all to English before inserting)

  2. If you’re using a cutoff, you may need different cutoffs for different languages.

As usual, since this is AI, your mileage may vary.

Hope this helps!

1 Like

This is a seriously interesting thread… I guess it comes down to one question. Does semantic similarity actually revolve around the underlying “meaning” as humans would describe it, or something else… is “Balle” the same in an embedding as as “ball”

2 Likes

This is all very context sensitive

edit: forgot baseball runeballe

1 Like

So… kinda hints at it NOT being semantic similarity, but some other domain completely, similar in meaning for that language… That has some interesting implications.

Ok, so what about common typos of baseball? are they in a similar “place” in latent space?

Well, it depends on how it’s mis-spelled

edit: interpretation: as you can see, you have these blocks of alignment. basedball is closer to a ballroom than a game. Of course, this is highly artificial data.

1 Like

define semantic XD. Sometimes it takes a book to define a single concept.

In another thread we hypothesized about semantic focal planes - and that more complex models (like davinci embeddings) might be more capable of distinguishing complex semantic concepts, but might perform worse at simple word or sentence embedding.

Jeeesh, yup, I see a whole new area of many many rabbit holes coming into range…

1 Like

I mean… humans can do this… those with multilingualism… so it’s not an impossible task for a neural network… it’s finding out HOW… damn neuromorphics again…

Easy! You just place your (text, image, smell) into a 500 - 100000 dimensional pinboard. If you can get multiple modi to share the same space, you have multimodality! Easy peasy lemon sqeezy, no neuromorphics required :laughing:

1 Like

but yes, multimodality and multilingualism could in theory be considered similar (or maybe even the same) mechanisms.

I think it’s not semantic meaning at all, but closer to semantics than keywords.

Here’s what I see going on, and hopefully this describes the situation …

So these LLM’s are a series of hidden layers, matrices, etc, and think of each hidden layer as a vector.

The embedding is generated from the last hidden layer, as floats, but before applying the bias and final activation functions. So they take this layer (vector) and normalize it back out to unit length and give you the vector, your embedding.

What does this layer mean?

It’s the vector state representing the next token prediction of all the previous tokens fed into it.

So the vector is “responding” to your text, understanding it, but isn’t saying anything since it’s an embedding model. It’s not allowed to say anything back.

So the embedding is the final hidden layer, the last state that will then produce a result (next token), but stopped just short of this. So it’s a frozen internal “understanding” of what you fed the model …

So is this the same as semantics and meaning?

No, not really. So what is it?

I think it’s like if someone said something to you, how would you feel about it? What would you say back based on your “training”, aka. life experiences. What if you could freeze your neural state at that moment, and send it out …

It’s this internal thought, or state, that is extracted. This isn’t semantics, it’s more of a snapshot of internal thoughts of what was said, what’s in the buffer. The LLM “thoughts” are the integrated excitation of states of all the tokens being ran into it.

These are then forced into chokepoints, and the most significant point of understanding is the final layer, which is essentially extracted, normalized and spit out.

Anyway, this is what I see going on.

2 Likes

Whether it’s semantic space, meaning space, or thought space is just a semantic argument imo. We’ve used the term semantic space since the days of word embeddings, so I’d stick with that.

3 Likes

To me the embedding vectors represent some sort of internal model state highly correlated to the meaning of the text.

Looking at the definition of semantics, yeah, maybe this is semantics :rofl:

2 Likes

Hmm, Great description of an embedding btw, and I get your point, however… If I think of a Ball and I think of a Balle… or I read either my mind creates an image of a ball, to me they mean the same thing.

So is that just experience? is that cross linking of dissimilar embeddings in my brain to one common visual model?

My issue is I don’t speak another language fluently enough for it to be “muscle memory” so I have no reference point.

Is this why new languages are so hard to learn? all the semantic cross linking that needs to happen before it makes any sense :thinking:

I would expect the network to become … (lmao I’m laughing my my word choice here) Ball adjacent when looking at the word ball or balle… so… I’d kind of expect any frozen state vector to point to the ball space, or at least bally things.

1 Like

The knowledge of a second language kind of feels like it lives in a different part of your brain. It takes more work to think about the English version of the word or how it would be written in English than just to understand the concept.


This topic still needs the application to test against or answer about.

Do you want to fill the vector database with Italian coffee machine manuals and have the AI answer questions in English from Italian text retrieval?

1 Like