Can an Assistant read the contents of non-English pdf files in a vector store?

rafiul.nakib · September 19, 2024, 3:32am

Hi Everyone,

I have been trying to build an Assistant that would read pdf files from a vector store and answer questions focused on the contents of those files. The files are not in English, at the moment I am only feeding Bengali language files. The problem I am facing is, whenever I instruct the Assistant to answer questions only from the files, it says it doesn’t know the answer. But if I remove the file-search, it does give me a fairly accurate answer from it’s existing knowledge.

The files I uploaded are a few physics textbook chapters.

Anyone faced a similar issue? Any suggested solution? Thanks!

_j · September 19, 2024, 6:05am

Hi, you describe something not experienced that degree. While others have had problems with this knowledge retrieval tool, it is not unexpected.

Unfortunately, you can’t directly place new instructions for the file search ability.

You can however help it if you address the AI with insturction of exactly how to produce its query to the msearch method of the myfiles_browser. The ideal way it should work would be with language as precise as “The files your developer uploaded are a few physics textbook chapters in the vector database of myfiles_browser. They are in Bengali, there for, your query must be exclusively Bengali language, where you discuss the provisional answer in Bengali language to activate semantic search on similar documents with the actual answer to be provided if available.”

While not a guaranteed solution because you aren’t allowed to truly be the “developer” of this tool, and this also requires the multilingual ability of the embeddings AI model ada-3-large used for vector dimension values, that may guide the assistants AI, and the extraction ability of any software that may be inspecting what was programmatically extracted from those documents if PDF, which must have searchable text as metadata and not just scanned pictures. Even moreso if the only language it ever reads is that which it shall produce.

rafiul.nakib · September 25, 2024, 3:42am

The text on the pdf I provided was not ‘selectable’, as in, more like an image. So the Assistant was not able to extract the text, hence, responded negatively. Once I provided a ‘selectable’ pdf it worked just fine!

Topic		Replies	Views
Issue with Uploading Arabic PDFs to File_Search Tool in Assistant API API assistants-api , vector-store , file-search	0	34	October 22, 2024
Improving File Search specificity w/ Assistant for accurate document processing API assistants-api , file-uploads	3	769	December 3, 2024
Issue with Assistant Handling of Attached PDF Files at Thread Level API assistants , assistants-files	2	59	September 27, 2024
How do I force the assistant to read all the content in the file being used for retrieval API api , rag , assistants-api	1	3321	December 5, 2023
Confused about how to use file_search effectively with my Assistant API assistants-api , assistants-files , file-search	9	403	September 30, 2024

Can an Assistant read the contents of non-English pdf files in a vector store?

Related topics