I’d like to add document search with text-embedding-ada-002 but need support for English, German and ideally also Spanish, French, Italian and Portuguese. I didn’t find any information about that but does the model support languages other than English?
GPT models were trained on a massive set of Internet data - not just English. I’d imagine that GPT-4 works for almost every language out there, except for very obscure or lost languages. Ada is obviously far less capable, so it well be less accurate, but yes, it’ll work for other languages too.
We did a huge embedded database using French, English, German, Spanish and Portuguese for an academic research paper.
The embedding worked well in multiple languages
However, we kept track of the source language for each piece of text we embedded
Then when we ran the final query, we asked the question in the same language. We found that if you embed in one language and query in another, the dot products are a bit skewed. But if you ask in the same language, the numbers come into line with each other.
In our case, we had a mix of source documents. So when we asked the final question(s), we converted the question into the 5 languages we knew we had. Then we ran the dot products over each of the sources that were in the matching languages (I hope that makes sense)
We took the top matches from each pass (ie semantic search in the native languages), and combined them into a single set (ie a mixed language result set). Then we sorted by the dot products to get the final top hits. This often resulted in a mix of languages
Once we did this, we sent the final query to GPT4 (or 3) and asked the question in English. Even though the sources were mixed languages, GPT3 managed to give us a combined answer from all the selected texts.
Ask questions if that didn’t make sense or you need clarification
This was a good insight. Thank you.
If I understand correctly, you have all the languages in the same namespace?
By skewed, what do you mean? If, for example I had a database with English, German, and Portuguese, would a German request always score higher with other German words, or is it possible that it will rank other languages higher?
In what other ways have you tried? Do you scrub the queries?
If you do a dot product of English vs German embedding,
the dot product value will be lower than
the dot product of the same query but with English vs English
Note: The following example is made up and the numbers are not real. But they are designed to explain the issue/resolution:
In other words if I have the following:
A: How are you (English)
B: Wie geht es dir (Same text - but German)
(They are the same thing - but in two different languages)
Now if I do a semantic search against this and do the dot product against the English text (Eg “give me a greeting”)
A might score 0.84 (English vs English)
B might score 0.78 (English vs German)
Now if I calculate the dot product against the German (Eg “gib mir einen Gruß”)
A will now score 0.78 (German vs English)
B will score 0.84 (German vs German)
So, as you can see, asking for a greeting in English gets a good hit for A (but not B)
and asking for a greeting in German gets a good hit for B (and not A)
But if we combine them,
A from the English search
B from the German search
Have the same values
Note: While this is not exactly true, the differences between languages are small enough not to matter - and SIGNIFICANTLY better than the earlier results. Even though this is not perfect, it gives good enough results.
PS : When you mentioned namespace, all of our texts were about Roman History and gathered from about 300 sources. They were combined into a single dataset and embedded as a single corpus. The queries were run in English - but the embedding checks were done in the 5 native languages (behind the scenes) We got Curie to do the translations of the query (not the embedding) as part of the workflow.
We enhanced our workflow because we actually ran multiple passes (20 in all) to query approx 50,000 tokens worth of embedding data for each question. We did this on GPT 4 and increased this to 140,000 tokens (roughly)
Our queries asked GPT to update the answer from the previous pass, based on the new context supplied in the subsequent queries (We did this 20 times) The quality of the answers were of academic standard.
We also limited the AI so it would only use citations included in the source embedding and would not hallucinate.
Was this using text-embedding-ada-002 specifically, for the creating the embeddings of the multi-language database?
It was. The language difference was when embedding with ada 002
Excellent work you have performed there, may I ask did the DP’s increase if you embedded something like “Please translate this from german to english “the question”” and adjust that to match for each case? I’m wondering if the embedding would give a greater similarity when the question imparts some meaning to it… if that makes any sense to you.
If you translate the embedded text to English before embedding it the dot product will be better. But the overhead of doing this is huge. So I embedded in German and then translated the question from English to German, instead of having to translate all the original text.
When I got the match German text vs German question, I ask the final query in English using the German text as a context and it works fine (ie the final query doesn’t all have to be in one language - only the embedded search and dot product is affected)
I hope that makes sense
ok yep, I just wondered if adding the word “translation” in there somehow made the cross language DP’s higher… a thing I may well go and test now
Sorry but I can’t answer that. Let us know how you get on with your tests though
I am encountering issues in embedding Spanish and Portuguese, only clustering two groups, any suggestions about pre processing data before calling Ada?
One should strip all carriage returns, as this seems to mess with the models in various manners.
I think the problem is ada’s fluency in distinguishing two close languages. It is a model that has 1/500th the parameters of davinci or ChatGPT.
That first dot at just under 0.4B - ada. While this graph is a translation task for language inference by GPT-3, you can see that in all cases one of the languages is English. Curie or Davinci may have greater abilities for this particular case of two close languages needing distinction (although it comes with a much higher amount of data, up to 12000), although counterintuitively, the larger model doesn’t always perform the best.
If all the passages are large, there are also techniques to chunk the documents and obtain multiple embeddings for each to synthesize a result.
If you really are just wanting to classify, you might see how simply how asking gpt-3.5-turbo performs.
I created two embeddings using ADA fow two words in Portuguse: cimento (cement) and sorvete (ice cream).
The cosine similarity between them was 0.8 what is clearly wrong.