So your embedded data is English. Then you send your query in Arabic? Then you get your result in Arabic, right? So when do you translate back to English? or the problem is already in the Arabic response?
Yeah I tried but as semantic search uses similarity using cosine it is not giving relevant searches or giving random searches. or we can say that as provided data is in English so it is not working with arabic queries.
for example: query in arabic says who is elon musk?
and I asks for top 2 matches
it will give the following
Neil Alden Armstrong was an American astronaut and aeronautical engineer
Stephen William Hawking was an English theoretical physicist, cosmologist, and author
I have a similar setup as yours. Although my embedded data are mixed language (English and Japanese). If I query in English, it can retrieve even the Japanese data accurately. If I query in Japanese, it is the same result. However, I do notice that if I query in English and the data is in Japanese, some proper nouns are not translated which is acceptable for me.
The embeddings engine is rather “topic based”. It will highlight similarities such as “space man”.
This seems like a good case for a “hypothetical answer embedding” - let the AI answer with what it knows, and then do an embedding semantic search with the AI’s answer not shown to the user. Then get your results and let the AI answer for real.
For English query->English knowledge, this company has a very promising new embedding offering. The only caveat with other embeddings besides OpenAI is they have a 512 context, so the chunking strategy needs to be re-thought.