Search query language hints

Hello! We are building a search with multilingual input(a search query can be in any language) on English texts. We are using text-embedding-3-large model for embedding calculations.
Usually it works quite well, if I do a query with several words in Swedish the search result is correct. But the issue comes when I search in Swedish but only one word that is close to English(“false friends”). For example, glass (“ice cream” in Swedish and “material or drink ware” in English), the model assumes it’s in English and the result is not correct. How can I hint the model about the language without changing the meaning of the query?
I tested several options:

  • [LANG:SV] glass
  • this sentence is in swedish: glass
  • svenska: glass
  • <p lang="sv">glass</p>
  • … etc

Nothing of this works. The returned results are connected to Sweden or Swedish language but not the search word self(“glass” in my example).

1 Like

Hi!
I gather that you are dealing with an edge case here. Assuming that you know the language of the single word search term, have you tried a prompt in that language?

From your failure cases I see that you used:

Which is in English. What if you switched the whole instruction set to the language that needs to be translated.

Otherwise I am also surprised that the model does not pick up on the instructions. Which model are you using?

1 Like

I tried to write the same phrase in Swedish but the results were not better.
I use text-embedding-3-large model

Embeddings is based on AI language model semantics, not a word search with a “language” parameter.

The embeddings is built from reward model training on large data just like the AI that talks to you, and upon its understanding of large passages of a text as a whole, where embeddings is internal concept formation that is part of model operation.

Ask an AI to complete an input:

min glass, the best way to do it is to use a timer. Here’s
min glass är tom och jag är ensam

You can see it takes context to activate “aha, Swedish”.

If we move this up to full understanding, and run an embeddings similarity search on phrases, we can compare results against the first.

— text-embedding-3-large
1.00000 - Kall glass är gott en varm dag
0.53430 - Barn älskar glass!
0.45912 - Cold ice cream is delicious on…
0.28334 - He poured the cream into a glass

The top result is the same language instead of the same phrase in English , so language identification is paramount. This can be by AI augmentation, such as “use this {Swedish} word in a {Swedish} language sentence, where the word is the focus.”

1 Like