Integrating Vision with Assistant API

Hi all,

Hope you can help me with this. I have a functional Assistant API built with Node.js, and I’m looking to enhance the user experience by enabling users to add images within the same conversation. The assistant will analyze these images and seamlessly continue the conversation based on the analysis.

Does anyone have experience/code to share with me? Would be great!

My PR (albeit in RoR) might help:

Thanks @merefield

Can you maybe explain the logic steps?

  1. Send image to the backend
  2. Have the Vision function seperate
  3. When the backend route receives an image from the frontend, trigger the Vision function and analyze the data
  4. Stream the data back to the frontend

Is this also your approach?

I’d appreciate if you could please read the code. Should be very readable and clear.

I have a little more time now. Apologies if I was a bit blunt, but I have would appreciated if your question would have demonstrated you had read my code. I wasn’t convinced! This would have saved me time if you could have been more specific in your question.

No, that’s not the flow with my (Discourse) Chatbot. It works like this:

  1. At some point in the conversation, an image is uploaded to the forum (in this case, Discourse)
  2. User asks a question about the image.
  3. Send the list of functions to the Chat Completion model (which includes the local Vision function) and include the query the User just made.
  4. I believe LLM works out that it should call the local Vision function because it identifies a strong semantic relationship between the function definition and the User query (this is all done on Open AI side).
  5. LLM responds with function call including, potentially, but optionally, a short phrase that represents what the user asked about the image.
  6. My local code handles the response and calls the Vision function.
  7. The Vision function unpacks the query parameter, finds the image from the current conversation on the forum, gets its URL from the Uploads rails model (a table of uploads) and then sends the query and the image URL specifically to the GPT 4 Vision model (or whatever is set in settings for Vision). NB the image must be public!
  8. The response is packaged up and sent back as the answer to the Chat Completions model.
  9. The LLM then responds to the User with its repackage of the answer.

Hi @merefield ,

Appreciate the message! I was just confused by your Github Repo, so was curious to hear your approach. Anyway, thanks for the reply :slight_smile:

1 Like

Yeah the whole repo would be overwhelming, that’s why I shared the PR. If you look at “Files Changed” on that PR it may be easier to digest.