I am using English embedded data for RAG purpose to get similar texts from data related to user query but the problem is that
When I am using another language i,e arabic and translate it into English and then find similar text it is not working well.
for example : Gross and total have same words for arabic in translation
So your embedded data is English. Then you send your query in Arabic? Then you get your result in Arabic, right? So when do you translate back to English? or the problem is already in the Arabic response?
I first translate the arabic into english and then send for sementic search but while translating it is not working well
i.e I query: Gross income
it translate: Total income
Ah, have you tried not translating the query to English?
Yeah I tried but as semantic search uses similarity using cosine it is not giving relevant searches or giving random searches. or we can say that as provided data is in English so it is not working with arabic queries.
for example: query in arabic says who is elon musk?
and I asks for top 2 matches
it will give the following
- Neil Alden Armstrong was an American astronaut and aeronautical engineer
- Stephen William Hawking was an English theoretical physicist, cosmologist, and author
Means totally random
I have a similar setup as yours. Although my embedded data are mixed language (English and Japanese). If I query in English, it can retrieve even the Japanese data accurately. If I query in Japanese, it is the same result. However, I do notice that if I query in English and the data is in Japanese, some proper nouns are not translated which is acceptable for me.
Can you please tell me which semantic search you used for that?
The embeddings engine is rather “topic based”. It will highlight similarities such as “space man”.
This seems like a good case for a “hypothetical answer embedding” - let the AI answer with what it knows, and then do an embedding semantic search with the AI’s answer not shown to the user. Then get your results and let the AI answer for real.
For English query->English knowledge, this company has a very promising new embedding offering. The only caveat with other embeddings besides OpenAI is they have a 512 context, so the chunking strategy needs to be re-thought.
Just simple cosine similarity. I am actually surprised that I can get a result. I also expected it cannot find anything given the language difference.
That’s awesome Thanks for help I think managing data may work better