What is the best embedding model for Arabic Data sets?

Dears,
What is the best embedding model for Arabic Data sets, as the current answers that I get from my “chat with your website” LLM application are not correct?

I am using currently
1- “text-embedding-ada-002” as embedding model
2- pinecone as an embedding vector store with cosine similarity to get best context to answer the query
3-‘gpt-3.5-turbo-instruct’ as my model to get answer on my query and its related context

The exact problem is the following ( it is in Arabic but I will explain it in English):
I asked " in the “personal data protection law”, what is the detail of term-no Nine "
based on cosine similarity, the best first three chunks will be term-no (Nineteen 19, twenty-nine 29 , thirty-nine 39 ) and so the answer from GPT will be wrong as the total context is wrong.

based on cosine similarity score, the right chunk which is term-no 9 is the sixth chunk and I am using the answer from the first 3 chunks.

Is there a tokenizer that is Arabic oriented
is there embedding model other than ada-002 that is also Arabic oriented
if yes, what is it and how please to get its API to use it.

if changing the embedding/Tokenization model is not the right solution for this problem, can you propose any other solutions please.

Regards.
Omran Badarneh

1 Like

The reason why this is happening is because term number 19 is has highest cosine similarity to the question you’re asking, using a different embedding model won’t change this behavior.

Think of embeddings as an abstraction layer, representing the context of “some text” as a set of numbers. You can use this to compare the texts, and find similar ones, or ones connected through similar context, but it’s not going to find the exact words.

I’ll recommend adding a “full text search” function to your code, and using that for questions like these :laughing:

Thanks N2U.
Can you please help on how to add “full text search”; just high level steps; I will drill down into more details…

1 Like

You’re welcome!

There’s already a few tutorials out there which applies here, since you’re using pinecone as a database :laughing:

What I’m calling “full text search” is named “keyword search” in pinecone, the method your already using is named “semantic search”, you can combine these into something called “hybrid search” for even better results:

And here’s an example of how to code such a thing:

1 Like

Thanks a lot N2U foryour usual support

1 Like