I’d like to add document search with text-embedding-ada-002 but need support for English, German and ideally also Spanish, French, Italian and Portuguese. I didn’t find any information about that but does the model support languages other than English?
GPT models were trained on a massive set of Internet data - not just English. I’d imagine that GPT-4 works for almost every language out there, except for very obscure or lost languages. Ada is obviously far less capable, so it well be less accurate, but yes, it’ll work for other languages too.
We did a huge embedded database using French, English, German, Spanish and Portuguese for an academic research paper.
The embedding worked well in multiple languages
However, we kept track of the source language for each piece of text we embedded
Then when we ran the final query, we asked the question in the same language. We found that if you embed in one language and query in another, the dot products are a bit skewed. But if you ask in the same language, the numbers come into line with each other.
In our case, we had a mix of source documents. So when we asked the final question(s), we converted the question into the 5 languages we knew we had. Then we ran the dot products over each of the sources that were in the matching languages (I hope that makes sense)
We took the top matches from each pass (ie semantic search in the native languages), and combined them into a single set (ie a mixed language result set). Then we sorted by the dot products to get the final top hits. This often resulted in a mix of languages
Once we did this, we sent the final query to GPT4 (or 3) and asked the question in English. Even though the sources were mixed languages, GPT3 managed to give us a combined answer from all the selected texts.
Ask questions if that didn’t make sense or you need clarification
This was a good insight. Thank you.
If I understand correctly, you have all the languages in the same namespace?
By skewed, what do you mean? If, for example I had a database with English, German, and Portuguese, would a German request always score higher with other German words, or is it possible that it will rank other languages higher?
In what other ways have you tried? Do you scrub the queries?
If you do a dot product of English vs German embedding,
the dot product value will be lower than
the dot product of the same query but with English vs English
Note: The following example is made up and the numbers are not real. But they are designed to explain the issue/resolution:
In other words if I have the following:
A: How are you (English)
B: Wie geht es dir (Same text - but German)
(They are the same thing - but in two different languages)
Now if I do a semantic search against this and do the dot product against the English text (Eg “give me a greeting”)
A might score 0.84 (English vs English)
B might score 0.78 (English vs German)
Now if I calculate the dot product against the German (Eg “gib mir einen Gruß”)
A will now score 0.78 (German vs English)
B will score 0.84 (German vs German)
So, as you can see, asking for a greeting in English gets a good hit for A (but not B)
and asking for a greeting in German gets a good hit for B (and not A)
But if we combine them,
A from the English search
B from the German search
Have the same values
Note: While this is not exactly true, the differences between languages are small enough not to matter - and SIGNIFICANTLY better than the earlier results. Even though this is not perfect, it gives good enough results.
Extra Info:
PS : When you mentioned namespace, all of our texts were about Roman History and gathered from about 300 sources. They were combined into a single dataset and embedded as a single corpus. The queries were run in English - but the embedding checks were done in the 5 native languages (behind the scenes) We got Curie to do the translations of the query (not the embedding) as part of the workflow.
We enhanced our workflow because we actually ran multiple passes (20 in all) to query approx 50,000 tokens worth of embedding data for each question. We did this on GPT 4 and increased this to 140,000 tokens (roughly)
Our queries asked GPT to update the answer from the previous pass, based on the new context supplied in the subsequent queries (We did this 20 times) The quality of the answers were of academic standard.
We also limited the AI so it would only use citations included in the source embedding and would not hallucinate.