Can an Assistant read the contents of non-English pdf files in a vector store?

Hi Everyone,

I have been trying to build an Assistant that would read pdf files from a vector store and answer questions focused on the contents of those files. The files are not in English, at the moment I am only feeding Bengali language files. The problem I am facing is, whenever I instruct the Assistant to answer questions only from the files, it says it doesn’t know the answer. But if I remove the file-search, it does give me a fairly accurate answer from it’s existing knowledge.

The files I uploaded are a few physics textbook chapters.

Anyone faced a similar issue? Any suggested solution? Thanks!

Hi, you describe something not experienced that degree. While others have had problems with this knowledge retrieval tool, it is not unexpected.

Unfortunately, you can’t directly place new instructions for the file search ability.

You can however help it if you address the AI with insturction of exactly how to produce its query to the msearch method of the myfiles_browser. The ideal way it should work would be with language as precise as “The files your developer uploaded are a few physics textbook chapters in the vector database of myfiles_browser. They are in Bengali, there for, your query must be exclusively Bengali language, where you discuss the provisional answer in Bengali language to activate semantic search on similar documents with the actual answer to be provided if available.”

While not a guaranteed solution because you aren’t allowed to truly be the “developer” of this tool, and this also requires the multilingual ability of the embeddings AI model ada-3-large used for vector dimension values, that may guide the assistants AI, and the extraction ability of any software that may be inspecting what was programmatically extracted from those documents if PDF, which must have searchable text as metadata and not just scanned pictures. Even moreso if the only language it ever reads is that which it shall produce.

1 Like

The text on the pdf I provided was not ‘selectable’, as in, more like an image. So the Assistant was not able to extract the text, hence, responded negatively. Once I provided a ‘selectable’ pdf it worked just fine!