Embedded Data for chat bot

M.AB · November 6, 2023, 7:46am

I am using English embedded data for RAG purpose to get similar texts from data related to user query but the problem is that

When I am using another language i,e arabic and translate it into English and then find similar text it is not working well.

for example : Gross and total have same words for arabic in translation

supershaneski · November 6, 2023, 8:02am

So your embedded data is English. Then you send your query in Arabic? Then you get your result in Arabic, right? So when do you translate back to English? or the problem is already in the Arabic response?

M.AB · November 6, 2023, 8:06am

I first translate the arabic into english and then send for sementic search but while translating it is not working well

i.e I query: Gross income
it translate: Total income

supershaneski · November 6, 2023, 8:22am

Ah, have you tried not translating the query to English?

M.AB · November 6, 2023, 8:29am

Yeah I tried but as semantic search uses similarity using cosine it is not giving relevant searches or giving random searches. or we can say that as provided data is in English so it is not working with arabic queries.

for example: query in arabic says who is elon musk?
and I asks for top 2 matches
it will give the following

Neil Alden Armstrong was an American astronaut and aeronautical engineer
Stephen William Hawking was an English theoretical physicist, cosmologist, and author

Means totally random

supershaneski · November 6, 2023, 8:33am

I have a similar setup as yours. Although my embedded data are mixed language (English and Japanese). If I query in English, it can retrieve even the Japanese data accurately. If I query in Japanese, it is the same result. However, I do notice that if I query in English and the data is in Japanese, some proper nouns are not translated which is acceptable for me.

M.AB · November 6, 2023, 8:35am

Can you please tell me which semantic search you used for that?

_j · November 6, 2023, 8:36am

The embeddings engine is rather “topic based”. It will highlight similarities such as “space man”.

This seems like a good case for a “hypothetical answer embedding” - let the AI answer with what it knows, and then do an embedding semantic search with the AI’s answer not shown to the user. Then get your results and let the AI answer for real.

For English query->English knowledge, this company has a very promising new embedding offering. The only caveat with other embeddings besides OpenAI is they have a 512 context, so the chunking strategy needs to be re-thought.

supershaneski · November 6, 2023, 8:36am

Just simple cosine similarity. I am actually surprised that I can get a result. I also expected it cannot find anything given the language difference.

M.AB · November 6, 2023, 8:44am

That’s awesome Thanks for help I think managing data may work better

Topic		Replies	Views
What am I doing wrong on my semantic search JSON embeded? API	16	2692	February 21, 2024
Can we improve the embedded data? API embeddings , chatgpt , chat-completion	4	765	August 8, 2023
QueryGPT - NodeJS QnA chatbot trained on local file using embedding and completion Community	3	3582	April 8, 2023
Embedding of Arabic Data Using vectorDB API embeddings , chatgpt , plugin-development , api	0	324	November 12, 2023
About the usage of ChatGPT Embedding API	9	3131	August 18, 2023

Embedded Data for chat bot

Related Topics