What is the best embedding model for Arabic Data sets?

obadarneh99 · May 1, 2024, 8:06pm

Dears,
What is the best embedding model for Arabic Data sets, as the current answers that I get from my “chat with your website” LLM application are not correct?

I am using currently
1- “text-embedding-ada-002” as embedding model
2- pinecone as an embedding vector store with cosine similarity to get best context to answer the query
3-‘gpt-3.5-turbo-instruct’ as my model to get answer on my query and its related context

The exact problem is the following ( it is in Arabic but I will explain it in English):
I asked " in the “personal data protection law”, what is the detail of term-no Nine "
based on cosine similarity, the best first three chunks will be term-no (Nineteen 19, twenty-nine 29 , thirty-nine 39 ) and so the answer from GPT will be wrong as the total context is wrong.

based on cosine similarity score, the right chunk which is term-no 9 is the sixth chunk and I am using the answer from the first 3 chunks.

Is there a tokenizer that is Arabic oriented
is there embedding model other than ada-002 that is also Arabic oriented
if yes, what is it and how please to get its API to use it.

if changing the embedding/Tokenization model is not the right solution for this problem, can you propose any other solutions please.

Regards.
Omran Badarneh

N2U · May 1, 2024, 8:50pm

The reason why this is happening is because term number 19 is has highest cosine similarity to the question you’re asking, using a different embedding model won’t change this behavior.

Think of embeddings as an abstraction layer, representing the context of “some text” as a set of numbers. You can use this to compare the texts, and find similar ones, or ones connected through similar context, but it’s not going to find the exact words.

I’ll recommend adding a “full text search” function to your code, and using that for questions like these

obadarneh99 · May 2, 2024, 7:01pm

Thanks N2U.
Can you please help on how to add “full text search”; just high level steps; I will drill down into more details…

N2U · May 3, 2024, 5:21pm

You’re welcome!

There’s already a few tutorials out there which applies here, since you’re using pinecone as a database

What I’m calling “full text search” is named “keyword search” in pinecone, the method your already using is named “semantic search”, you can combine these into something called “hybrid search” for even better results:

And here’s an example of how to code such a thing:

github.com

pinecone-io/examples/blob/master/learn/search/hybrid-search/ecommerce-search/ecommerce-search.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4UAkYO8XUDxW"
      },
      "source": [
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/search/hybrid-search/ecommerce-search/ecommerce-search.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/search/hybrid-search/ecommerce-search/ecommerce-search.ipynb)\n",
        "\n",
        "# Hybrid Search for E-Commerce with Pinecone"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "CrtPzpDfUN56"
      },
      "source": [
        "\n",

This file has been truncated. show original

obadarneh99 · May 4, 2024, 5:00pm

Thanks a lot N2U foryour usual support

Topic		Replies	Views
Embedded Data for chat bot API gpt-35-turbo , chatgpt , semantic-search	9	1091	November 6, 2023
Questions about the embedding-based chatbot API embedding	4	162	December 15, 2024
Embedding and searching from similar embeddings API	6	6757	October 27, 2023
Reducing Cost of GPT 4 by using embeddings Prompting	23	10632	May 4, 2023
How can I customize the gpt API	9	3224	March 9, 2023

What is the best embedding model for Arabic Data sets?

Related topics