OpenAI Embeddings - Multi language

mehdi.zare1 · December 13, 2023, 11:17pm

I am working on a project that requires embedding from different languages. Can I rely on OpenAI embeddings for different languages?

My main question is about similarity of the same sentence being embedded from different languages.

Is there any source I can refer to about this?

supershaneski · December 13, 2023, 11:48pm

From my own experience using embeddings, you can embed the data in whatever language and query it using different language and you will still get good result as long as you pass it to the chat completions API for summary. For example, my embedded data is English and Japanese then I can use a query say Spanish and still get my desired output.

_j · December 13, 2023, 11:53pm

It really depends on if the goal is to classify similarity (bitext) regardless of language, or if your goal is ancillary tasks, such as database retrieval.

LaBSE for example, tops the charts on measuring similarity between translated language passages, but ranks far below text-ada-embeddings-002 for retrieval, matching small-large, for example.

mehdi.zare1 · December 14, 2023, 12:31am

My use case is basically a multi language RAG system. There can be mismatch between the language of query and retrieved document.

Diet · December 14, 2023, 1:50am

From experience, most people would probably say that it wouldn’t be a problem.

here are some numbers from another thread:

these are comparing bible passages (mark chapter 1 and 2 from different english translations) to each other. for your particular question, mark1_lu17 is interesting because it’s german, all the other ones are english.

Here are the conclusions from this tiny experiment:

If you have a multilanguage corpus, you may get a higher score out of a slightly worse match if it’s the same language as the query
If your corpus is all in the same language, you might get generally get slightly worse scores, but the rank order likely won’t be affected

Here are the takeaways:

It’s probably best to have your corpus in a single language, even if it’s the wrong language, rather than having a patchwork of languages in your corpus (i.e. translate them all to English before inserting)
If you’re using a cutoff, you may need different cutoffs for different languages.

As usual, since this is AI, your mileage may vary.

Hope this helps!

Foxalabs · December 14, 2023, 1:54am

This is a seriously interesting thread… I guess it comes down to one question. Does semantic similarity actually revolve around the underlying “meaning” as humans would describe it, or something else… is “Balle” the same in an embedding as as “ball”

Diet · December 14, 2023, 2:11am

This is all very context sensitive

edit: forgot baseball runeballe

Foxalabs · December 14, 2023, 2:37am

So… kinda hints at it NOT being semantic similarity, but some other domain completely, similar in meaning for that language… That has some interesting implications.

Foxalabs · December 14, 2023, 2:38am

Ok, so what about common typos of baseball? are they in a similar “place” in latent space?

Diet · December 14, 2023, 3:06am

Well, it depends on how it’s mis-spelled

edit: interpretation: as you can see, you have these blocks of alignment. basedball is closer to a ballroom than a game. Of course, this is highly artificial data.

Diet · December 14, 2023, 3:10am

define semantic XD. Sometimes it takes a book to define a single concept.

In another thread we hypothesized about semantic focal planes - and that more complex models (like davinci embeddings) might be more capable of distinguishing complex semantic concepts, but might perform worse at simple word or sentence embedding.

Foxalabs · December 14, 2023, 3:18am

Jeeesh, yup, I see a whole new area of many many rabbit holes coming into range…

Foxalabs · December 14, 2023, 3:20am

I mean… humans can do this… those with multilingualism… so it’s not an impossible task for a neural network… it’s finding out HOW… damn neuromorphics again…

Diet · December 14, 2023, 3:27am

Easy! You just place your (text, image, smell) into a 500 - 100000 dimensional pinboard. If you can get multiple modi to share the same space, you have multimodality! Easy peasy lemon sqeezy, no neuromorphics required

Diet · December 14, 2023, 3:31am

but yes, multimodality and multilingualism could in theory be considered similar (or maybe even the same) mechanisms.

curt.kennedy · December 14, 2023, 3:44am

I think it’s not semantic meaning at all, but closer to semantics than keywords.

Here’s what I see going on, and hopefully this describes the situation …

So these LLM’s are a series of hidden layers, matrices, etc, and think of each hidden layer as a vector.

The embedding is generated from the last hidden layer, as floats, but before applying the bias and final activation functions. So they take this layer (vector) and normalize it back out to unit length and give you the vector, your embedding.

What does this layer mean?

It’s the vector state representing the next token prediction of all the previous tokens fed into it.

So the vector is “responding” to your text, understanding it, but isn’t saying anything since it’s an embedding model. It’s not allowed to say anything back.

So the embedding is the final hidden layer, the last state that will then produce a result (next token), but stopped just short of this. So it’s a frozen internal “understanding” of what you fed the model …

So is this the same as semantics and meaning?

No, not really. So what is it?

I think it’s like if someone said something to you, how would you feel about it? What would you say back based on your “training”, aka. life experiences. What if you could freeze your neural state at that moment, and send it out …

It’s this internal thought, or state, that is extracted. This isn’t semantics, it’s more of a snapshot of internal thoughts of what was said, what’s in the buffer. The LLM “thoughts” are the integrated excitation of states of all the tokens being ran into it.

These are then forced into chokepoints, and the most significant point of understanding is the final layer, which is essentially extracted, normalized and spit out.

Anyway, this is what I see going on.

Diet · December 14, 2023, 3:58am

Whether it’s semantic space, meaning space, or thought space is just a semantic argument imo. We’ve used the term semantic space since the days of word embeddings, so I’d stick with that.

curt.kennedy · December 14, 2023, 4:21am

To me the embedding vectors represent some sort of internal model state highly correlated to the meaning of the text.

Looking at the definition of semantics, yeah, maybe this is semantics

Foxalabs · December 14, 2023, 6:36am

Hmm, Great description of an embedding btw, and I get your point, however… If I think of a Ball and I think of a Balle… or I read either my mind creates an image of a ball, to me they mean the same thing.

So is that just experience? is that cross linking of dissimilar embeddings in my brain to one common visual model?

My issue is I don’t speak another language fluently enough for it to be “muscle memory” so I have no reference point.

Is this why new languages are so hard to learn? all the semantic cross linking that needs to happen before it makes any sense

I would expect the network to become … (lmao I’m laughing my my word choice here) Ball adjacent when looking at the word ball or balle… so… I’d kind of expect any frozen state vector to point to the ball space, or at least bally things.

_j · December 14, 2023, 8:09am

The knowledge of a second language kind of feels like it lives in a different part of your brain. It takes more work to think about the English version of the word or how it would be written in English than just to understand the concept.

This topic still needs the application to test against or answer about.

Do you want to fill the vector database with Italian coffee machine manuals and have the AI answer questions in English from Italian text retrieval?

Topic		Replies	Views
Does ada support other languages than English? API embeddings , question	13	13237	October 18, 2023
Expected Angular Differences in Embedding Random Text? API	9	1178	December 24, 2023
Question on text-embedding-ada-002 API	12	6505	December 24, 2023
Quality of embeddings using davinci-001 embeddings model vs. ada-002 model API embeddings	15	4387	April 9, 2024
Can this api be used to query internal data? API	35	8584	April 20, 2023

OpenAI Embeddings - Multi language

Related topics