Embedding in Arabic language

I am trying to do embeddings for the Arabic language when i tried to add Arabic text in my data file it keeps giving me errors but when I only have English text the embeddings works perfectly fine is there a way to fix this? I tried separating the data to have each language in a stand alone file but that did not work. I also tried encoding “utf-8” but that also did not help.

3 Likes

Hi welcome to the community. It would be helpful if you can share the exact error messages you’re receiving.

The error is kina weird and long, I am using langchain to do this.

Traceback (most recent call last):
  File "C:\Users\ameen\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\document_loaders\text.py", line 41, in load
    text = f.read()
           ^^^^^^^^
  File "C:\Users\ameen\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 43352: character maps to <undefined>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\ameen\Downloads\chatgpt-retrieval-main\gpt.py", line 35, in <module>
    index = VectorstoreIndexCreator().from_loaders([loader])
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ameen\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\indexes\vectorstore.py", line 81, in from_loaders
    docs.extend(loader.load())
                ^^^^^^^^^^^^^
  File "C:\Users\ameen\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\document_loaders\text.py", line 54, in load
    raise RuntimeError(f"Error loading {self.file_path}") from e
RuntimeError: Error loading data/data.txt

When my data.txt has english text it works fine but when i add an arabic sentence this error happens.

This is actually not OpenAI’s fault. Seems like one of the libraries used by LangChain to process documents doesn’t support Arabic characters.

1 Like

Do you have an suggestion on this? Is there a library to use for Arabic?

Add encoding='utf-8' to the open statement, like this:
with open('data/data.txt', 'r', encoding='utf-8')

1 Like

I have the same problem. I tried in the Microsoft Azure platform with errors and a huge bill when I got no response at all.