Languages supported by text-embedding-3-large

How can I find out what languages the new text-embedding-3-large embedding model supports? In particular, I am trying to find out if it supports Hebrew.

afaik, I thought the embedding models were relatively language agnostic, because itโ€™s working with tokens, which are just special chunks of utf-8 chars.

Have you tried encoding the hebrew text to utf-8 before passing it through the embedding model? I wonder if that would help


The source text is created by my Apache Solr engine which uses utf-8 by default, but Iโ€™ll double check that. Thanks.

1 Like

Itโ€™s going to support it, but the question is: how well.

Some example code. Hebrew input, translations, and a near miss.

from openai import OpenAI as o; cl = o()
import numpy as np

text =[ "ืืœ ืชืœื˜ืฃ ืืช ื”ื“ื•ืจื‘ืŸ.", "ืื ื™ ืื•ื”ื‘/ืช ืื™ื™ืคื•ืŸ!"]
text += ["Don't pet the porcupine.", "I love iPhone!", "Avoid the platypus"]

for model in ["text-embedding-3-small", "text-embedding-3-large"]:
        out = cl.embeddings.create(input=text, model=model)
        print("\n---", model)
    except Exception as e:
        print(f"ERROR {e}")
    array = np.array([data.embedding for data in])
    for compi, comp in enumerate(text[:2]):
        print("====", compi, comp, "====")
        for i, j in zip(text, array):
            print(f"{i}: {[compi], j):.5f}")

Gives us some data points:

--- text-embedding-3-small
==== 0 ืืœ ืชืœื˜ืฃ ืืช ื”ื“ื•ืจื‘ืŸ. ====
ืืœ ืชืœื˜ืฃ ืืช ื”ื“ื•ืจื‘ืŸ.: 1.00000
ืื ื™ ืื•ื”ื‘/ืช ืื™ื™ืคื•ืŸ!: 0.26534
Don't pet the porcupine.: 0.25681
I love iPhone!: 0.08965
Avoid the platypus: 0.25611
==== 1 ืื ื™ ืื•ื”ื‘/ืช ืื™ื™ืคื•ืŸ! ====
ืืœ ืชืœื˜ืฃ ืืช ื”ื“ื•ืจื‘ืŸ.: 0.26534
ืื ื™ ืื•ื”ื‘/ืช ืื™ื™ืคื•ืŸ!: 1.00000
Don't pet the porcupine.: 0.07188
I love iPhone!: 0.62643
Avoid the platypus: 0.02311
--- text-embedding-3-large
==== 0 ืืœ ืชืœื˜ืฃ ืืช ื”ื“ื•ืจื‘ืŸ. ====
ืืœ ืชืœื˜ืฃ ืืช ื”ื“ื•ืจื‘ืŸ.: 1.00000
ืื ื™ ืื•ื”ื‘/ืช ืื™ื™ืคื•ืŸ!: 0.30880
Don't pet the porcupine.: 0.28773
I love iPhone!: 0.01331
Avoid the platypus: 0.20583
==== 1 ืื ื™ ืื•ื”ื‘/ืช ืื™ื™ืคื•ืŸ! ====
ืืœ ืชืœื˜ืฃ ืืช ื”ื“ื•ืจื‘ืŸ.: 0.30880
ืื ื™ ืื•ื”ื‘/ืช ืื™ื™ืคื•ืŸ!: 1.00000
Don't pet the porcupine.: 0.01012
I love iPhone!: 0.58528
Avoid the platypus: 0.04751


3-small canโ€™t distinguish porcupine from platypus comparing to English.
3-large can do that much better

Both tend to prefer their own language about a different subject instead of the direct translation. This is not seen in comparing Latin languages.

All-Hebrew analysis is not done, as I would not and few readers would understand the results. You can come up with your own native-written texts for the quick script for curiosity. Then embed your application.


Thanks for this. What I am hoping will work is embedding the translated English text along with the Hebrew text so that the more accurate similarity search would be matching English to English.

Itโ€™s just an experiment right now, but if successful I would be able to use the models to allow users to query classic works of Jewish theology, many of which have yet to be translated to English.

Pretty darned close here! And this is still using text-embedding-ada-002!

This is a query of my vector store object for that document:

I believe that translation was done by either gpt-3.5-turbo-16k or gemini-pro.