Using vision in Assistants and vector databases

Hello, I am working with OpenAI assistants and they don’t seem to disclose how their RAG works. I was wondering if images are passed as context to the LLM or is it only text ? I am asking this question cuz I am doing Retrieval from a bank of PDFs that contain schemas, and passing these schemas as images to chatgpt-vision helps getting a better answer, so I thought to myself, would OpenAI assistants do that ?

Thanks in advance.

You can pass images via the Assistant API. The documentation describes this here: https://platform.openai.com/docs/assistants/deep-dive/creating-image-input-content

You should note though that you would have to pass the schemas as a separate image with the purpose vision. What is currently not possible is to upload a PDF file and then have the Assistant process both the text and the image at the same time.

1 Like

I think, I am not sure, but what it seems like is you are trying to input an image, then get the same image output through GPT 4o,. It will not do that now, because it renders every image. It never copies an image like a scan. You can scan an image in, but if you ask for that image back with different words or something, it does not use the same image. It had to render a new one, and it can’t seem to do the same exact thing twice. If I am way off, then just ignore me lol.

Not exactly what I am trying to do, I have a bank of images that I need to extract information from, when I try with chatGPT vision it works, but if I try by passing it through an OCR and then giving it to chatGPT as text, it doesn’t work. So I am wondering if Assistants, would retrieve images along with some text.