Vision within file_search: possible? good?

Hi. I’m trying to create an agent/assistant that recognizes paintings and, in general, images, when given access to them via RAG/file_search. Using the assistants API, I have created a vector store with 15 files, each of them contains an image of the painting and some information in text about it. When I upload the same image to the assistant and ask about it, I don’t get great results, even if I have instructed the assistant that it has access to a certain collection of paintings the user is going to ask about. Sometimes it is able to identify the painting correctly, sometimes it won’t, getting it confused with a different one. This is a painting that the model never gets right without context/uploaded files, which tells me it is partially working, although I wonder if the file_search tool is using vision to check the images and compare them or just reading the titles and artists in the text of each painting inspires them to get it right by just reducing the possible answers… This is consistent across different paintings and LLM models. Before going deeper experimenting with the responses API or with different formats for the files or organization of the vector store… It’d be helpful to know… does file_search / RAG work for images as well? does it perform vision? do you know of any interesting studies that may speak about the efficacy of it?

Google Lens seems to work better to identify full paintings but it fails at identifying chunks of them. Thanks