I am trying to do embeddings for the Arabic language when i tried to add Arabic text in my data file it keeps giving me errors but when I only have English text the embeddings works perfectly fine is there a way to fix this? I tried separating the data to have each language in a stand alone file but that did not work. I also tried encoding “utf-8” but that also did not help.
3 Likes
Hi welcome to the community. It would be helpful if you can share the exact error messages you’re receiving.
The error is kina weird and long, I am using langchain to do this.
Traceback (most recent call last):
File "C:\Users\ameen\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\document_loaders\text.py", line 41, in load
text = f.read()
^^^^^^^^
File "C:\Users\ameen\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 43352: character maps to <undefined>
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\ameen\Downloads\chatgpt-retrieval-main\gpt.py", line 35, in <module>
index = VectorstoreIndexCreator().from_loaders([loader])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\ameen\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\indexes\vectorstore.py", line 81, in from_loaders
docs.extend(loader.load())
^^^^^^^^^^^^^
File "C:\Users\ameen\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\document_loaders\text.py", line 54, in load
raise RuntimeError(f"Error loading {self.file_path}") from e
RuntimeError: Error loading data/data.txt
When my data.txt has english text it works fine but when i add an arabic sentence this error happens.
This is actually not OpenAI’s fault. Seems like one of the libraries used by LangChain to process documents doesn’t support Arabic characters.
1 Like
Do you have an suggestion on this? Is there a library to use for Arabic?
Add encoding='utf-8' to the open statement, like this:
with open('data/data.txt', 'r', encoding='utf-8')
1 Like
I have the same problem. I tried in the Microsoft Azure platform with errors and a huge bill when I got no response at all.