Embedding in Arabic language

I am trying to do embeddings for the Arabic language when i tried to add Arabic text in my data file it keeps giving me errors but when I only have English text the embeddings works perfectly fine is there a way to fix this? I tried separating the data to have each language in a stand alone file but that did not work. I also tried encoding “utf-8” but that also did not help.

1 Like

Hi welcome to the community. It would be helpful if you can share the exact error messages you’re receiving.

The error is kina weird and long, I am using langchain to do this.

Traceback (most recent call last):
  File "C:\Users\ameen\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\document_loaders\text.py", line 41, in load
    text = f.read()
           ^^^^^^^^
  File "C:\Users\ameen\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 43352: character maps to <undefined>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\ameen\Downloads\chatgpt-retrieval-main\gpt.py", line 35, in <module>
    index = VectorstoreIndexCreator().from_loaders([loader])
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ameen\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\indexes\vectorstore.py", line 81, in from_loaders
    docs.extend(loader.load())
                ^^^^^^^^^^^^^
  File "C:\Users\ameen\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\document_loaders\text.py", line 54, in load
    raise RuntimeError(f"Error loading {self.file_path}") from e
RuntimeError: Error loading data/data.txt

When my data.txt has english text it works fine but when i add an arabic sentence this error happens.

This is actually not OpenAI’s fault. Seems like one of the libraries used by LangChain to process documents doesn’t support Arabic characters.

Do you have an suggestion on this? Is there a library to use for Arabic?

Add encoding='utf-8' to the open statement, like this:
with open('data/data.txt', 'r', encoding='utf-8')