If you do a dot product of English vs German embedding,
the dot product value will be lower than
the dot product of the same query but with English vs English
Note: The following example is made up and the numbers are not real. But they are designed to explain the issue/resolution:
In other words if I have the following:
A: How are you (English)
B: Wie geht es dir (Same text - but German)
(They are the same thing - but in two different languages)
Now if I do a semantic search against this and do the dot product against the English text (Eg “give me a greeting”)
A might score 0.84 (English vs English)
B might score 0.78 (English vs German)
Now if I calculate the dot product against the German (Eg “gib mir einen Gruß”)
A will now score 0.78 (German vs English)
B will score 0.84 (German vs German)
So, as you can see, asking for a greeting in English gets a good hit for A (but not B)
and asking for a greeting in German gets a good hit for B (and not A)
But if we combine them,
A from the English search
B from the German search
Have the same values
Note: While this is not exactly true, the differences between languages are small enough not to matter - and SIGNIFICANTLY better than the earlier results. Even though this is not perfect, it gives good enough results.
Extra Info:
PS : When you mentioned namespace, all of our texts were about Roman History and gathered from about 300 sources. They were combined into a single dataset and embedded as a single corpus. The queries were run in English - but the embedding checks were done in the 5 native languages (behind the scenes) We got Curie to do the translations of the query (not the embedding) as part of the workflow.
We enhanced our workflow because we actually ran multiple passes (20 in all) to query approx 50,000 tokens worth of embedding data for each question. We did this on GPT 4 and increased this to 140,000 tokens (roughly)
Our queries asked GPT to update the answer from the previous pass, based on the new context supplied in the subsequent queries (We did this 20 times) The quality of the answers were of academic standard.
We also limited the AI so it would only use citations included in the source embedding and would not hallucinate.