Embedding in Arabic language

ameen_samhan · September 21, 2023, 9:18am

I am trying to do embeddings for the Arabic language when i tried to add Arabic text in my data file it keeps giving me errors but when I only have English text the embeddings works perfectly fine is there a way to fix this? I tried separating the data to have each language in a stand alone file but that did not work. I also tried encoding “utf-8” but that also did not help.

Innovatix · September 21, 2023, 11:11am

Hi welcome to the community. It would be helpful if you can share the exact error messages you’re receiving.

ameen_samhan · September 24, 2023, 7:56am

The error is kina weird and long, I am using langchain to do this.

Traceback (most recent call last):
  File "C:\Users\ameen\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\document_loaders\text.py", line 41, in load
    text = f.read()
           ^^^^^^^^
  File "C:\Users\ameen\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 43352: character maps to <undefined>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\ameen\Downloads\chatgpt-retrieval-main\gpt.py", line 35, in <module>
    index = VectorstoreIndexCreator().from_loaders([loader])
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ameen\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\indexes\vectorstore.py", line 81, in from_loaders
    docs.extend(loader.load())
                ^^^^^^^^^^^^^
  File "C:\Users\ameen\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\document_loaders\text.py", line 54, in load
    raise RuntimeError(f"Error loading {self.file_path}") from e
RuntimeError: Error loading data/data.txt

When my data.txt has english text it works fine but when i add an arabic sentence this error happens.

NotFenixio · September 24, 2023, 9:17am

This is actually not OpenAI’s fault. Seems like one of the libraries used by LangChain to process documents doesn’t support Arabic characters.

ameen_samhan · September 25, 2023, 2:13pm

Do you have an suggestion on this? Is there a library to use for Arabic?

NotFenixio · September 25, 2023, 2:42pm

Add encoding='utf-8' to the open statement, like this:
with open('data/data.txt', 'r', encoding='utf-8')

amine.sabaoui · March 13, 2025, 2:08pm

I have the same problem. I tried in the Microsoft Azure platform with errors and a huge bill when I got no response at all.

Topic		Replies	Views
Fixing ‘ascii’ codec can’t encode ‘\u2014’ error in OpenAI API during vector store embedding API chatgpt , plugin-development , api	0	377	February 4, 2025
Embeddings API and Support of Greek Language API embeddings	0	889	June 9, 2023
Issue with Uploading Arabic PDFs to File_Search Tool in Assistant API API assistants-api , vector-store , file-search	1	112	May 20, 2025
Embedding of Arabic Data Using vectorDB API embeddings , chatgpt , plugin-development , api	0	639	November 12, 2023
Create emebeddings API does not work if the text embedded is in iso-8859-1 format API	1	618	March 26, 2023

Embedding in Arabic language

Related topics