Languages supported by text-embedding-3-large

SomebodySysop · March 8, 2024, 8:42pm

How can I find out what languages the new text-embedding-3-large embedding model supports? In particular, I am trying to find out if it supports Hebrew.

Macha · March 8, 2024, 10:49pm

afaik, I thought the embedding models were relatively language agnostic, because it’s working with tokens, which are just special chunks of utf-8 chars.

Have you tried encoding the hebrew text to utf-8 before passing it through the embedding model? I wonder if that would help

SomebodySysop · March 8, 2024, 11:24pm

The source text is created by my Apache Solr engine which uses utf-8 by default, but I’ll double check that. Thanks.

_j · March 9, 2024, 2:21am

It’s going to support it, but the question is: how well.

Some example code. Hebrew input, translations, and a near miss.

from openai import OpenAI as o; cl = o()
import numpy as np

text =[ "אל תלטף את הדורבן.", "אני אוהב/ת אייפון!"]
text += ["Don't pet the porcupine.", "I love iPhone!", "Avoid the platypus"]

for model in ["text-embedding-3-small", "text-embedding-3-large"]:
    try:
        out = cl.embeddings.create(input=text, model=model)
        print("\n---", model)
    except Exception as e:
        print(f"ERROR {e}")
    array = np.array([data.embedding for data in out.data])
    for compi, comp in enumerate(text[:2]):
        print("====", compi, comp, "====")
        for i, j in zip(text, array):
            print(f"{i}: {np.dot(array[compi], j):.5f}")

Gives us some data points:

--- text-embedding-3-small
==== 0 אל תלטף את הדורבן. ====
אל תלטף את הדורבן.: 1.00000
אני אוהב/ת אייפון!: 0.26534
Don't pet the porcupine.: 0.25681
I love iPhone!: 0.08965
Avoid the platypus: 0.25611
==== 1 אני אוהב/ת אייפון! ====
אל תלטף את הדורבן.: 0.26534
אני אוהב/ת אייפון!: 1.00000
Don't pet the porcupine.: 0.07188
I love iPhone!: 0.62643
Avoid the platypus: 0.02311

--- text-embedding-3-large
==== 0 אל תלטף את הדורבן. ====
אל תלטף את הדורבן.: 1.00000
אני אוהב/ת אייפון!: 0.30880
Don't pet the porcupine.: 0.28773
I love iPhone!: 0.01331
Avoid the platypus: 0.20583
==== 1 אני אוהב/ת אייפון! ====
אל תלטף את הדורבן.: 0.30880
אני אוהב/ת אייפון!: 1.00000
Don't pet the porcupine.: 0.01012
I love iPhone!: 0.58528
Avoid the platypus: 0.04751

Analysis:

3-small can’t distinguish porcupine from platypus comparing to English.
3-large can do that much better

Both tend to prefer their own language about a different subject instead of the direct translation. This is not seen in comparing Latin languages.

All-Hebrew analysis is not done, as I would not and few readers would understand the results. You can come up with your own native-written texts for the quick script for curiosity. Then embed your application.

SomebodySysop · March 9, 2024, 2:43am

Thanks for this. What I am hoping will work is embedding the translated English text along with the Hebrew text so that the more accurate similarity search would be matching English to English.

It’s just an experiment right now, but if successful I would be able to use the models to allow users to query classic works of Jewish theology, many of which have yet to be translated to English.

SomebodySysop · March 9, 2024, 9:17am

Pretty darned close here! And this is still using text-embedding-ada-002!

This is a query of my vector store object for that document:

I believe that translation was done by either gpt-3.5-turbo-16k or gemini-pro.

Topic		Replies	Views
Languages supported by 'text-embedding-3-small' API embeddings , api , large-language-model , languages	0	306	July 31, 2024
How many human languages does text-embedding-ada-002 support? API languages , dialects	4	10089	February 23, 2024
Embedding in a different language API	3	4058	December 14, 2023
Embeddings API and Support of Greek Language API embeddings	0	742	June 9, 2023
Do text-embedding-3-small embeddings create similar vectors for the same text in different languages? API	0	74	November 5, 2024

Languages supported by text-embedding-3-large

Analysis:

Related topics