Integrating Vision with Assistant API

Hi all,

Hope you can help me with this. I have a functional Assistant API built with Node.js, and I’m looking to enhance the user experience by enabling users to add images within the same conversation. The assistant will analyze these images and seamlessly continue the conversation based on the analysis.

Does anyone have experience/code to share with me? Would be great!

1 Like

My PR (albeit in RoR) might help:

Thanks @merefield

Can you maybe explain the logic steps?

  1. Send image to the backend
  2. Have the Vision function seperate
  3. When the backend route receives an image from the frontend, trigger the Vision function and analyze the data
  4. Stream the data back to the frontend

Is this also your approach?

I’d appreciate if you could please read the code. Should be very readable and clear.

I have a little more time now. Apologies if I was a bit blunt, but I have would appreciated if your question would have demonstrated you had read my code. I wasn’t convinced! This would have saved me time if you could have been more specific in your question.

No, that’s not the flow with my (Discourse) Chatbot. It works like this:

  1. At some point in the conversation, an image is uploaded to the forum (in this case, Discourse)
  2. User asks a question about the image.
  3. Send the list of functions to the Chat Completion model (which includes the local Vision function) and include the query the User just made.
  4. I believe LLM works out that it should call the local Vision function because it identifies a strong semantic relationship between the function definition and the User query (this is all done on Open AI side).
  5. LLM responds with function call including, potentially, but optionally, a short phrase that represents what the user asked about the image.
  6. My local code handles the response and calls the Vision function.
  7. The Vision function unpacks the query parameter, finds the image from the current conversation on the forum, gets its URL from the Uploads rails model (a table of uploads) and then sends the query and the image URL specifically to the GPT 4 Vision model (or whatever is set in settings for Vision). NB the image must be public!
  8. The response is packaged up and sent back as the answer to the Chat Completions model.
  9. The LLM then responds to the User with its repackage of the answer.

Hi @merefield ,

Appreciate the message! I was just confused by your Github Repo, so was curious to hear your approach. Anyway, thanks for the reply :slight_smile:

1 Like

Yeah the whole repo would be overwhelming, that’s why I shared the PR. If you look at “Files Changed” on that PR it may be easier to digest.

If anyone is interested in this, I got it working

Let me know if you want to see the code snippet! Next up: text-to-speech :joy:

Offtopic: do you guys think Vision and text-to-speech will be integrated into the assistant-api’s any time soon?

Hi, can you share the code. I am creating a chatbot where I want user to upload an image during a conversation, I have different processing pipeline for image and I do not want GPT4V. I am confused about how should I validate whether the user has uploaded an image or not.

Yes sure, here it is:

router.post(“/assistant-chat”, async (req, res) => {
const { threadId, imageUrl, contextText } = req.body;
let { userMessage } = req.body;

try {
let currentThreadId = threadId;
// If a threadId is provided and exists, continue, otherwise, create a new thread
if (!currentThreadId || !threadResponses[currentThreadId]) {
// Create a new thread for the new case
const threadResponse = await openai.beta.threads.create();
currentThreadId = threadResponse.id;
console.log(“New thread created with ID:”, currentThreadId);

  // Initialize storage for this new thread
  threadResponses[currentThreadId] = { events: [], clients: [] };
} else {
  console.log("Continuing conversation on thread:", currentThreadId);
}

// Here, before sending the user's message to OpenAI, signal the start of a new message
sendEventToAllClients(currentThreadId, { event: "messageStart", data: {} });

// check if user submit image
if (imageUrl) {
  const visionResponse = await analyzeImage(imageUrl, contextText);
  userMessage = visionResponse;
}

// Add the user's message to the thread
const messageResponse = await openai.beta.threads.messages.create(currentThreadId, {
  role: "user",
  content: userMessage,
});
console.log("User message added to the thread:", messageResponse);

// Stream the Run using the newly created or existing thread ID
const stream = openai.beta.threads.runs
  .createAndStream(currentThreadId, {
    assistant_id: assistantIdToUse, // Ensure this variable is correctly defined
  })
  .on("textCreated", (text) => {
    console.log("textCreated event:", text);
    sendEventToAllClients(currentThreadId, { event: "textCreated", data: text });
  })
  .on("textDelta", (textDelta) => {
    // Optionally log textDelta events
    console.log("textDelta event Carl:", textDelta);
    sendEventToAllClients(currentThreadId, { event: "textDelta", data: textDelta });
  })
  .on("toolCallCreated", (toolCall) => {
    console.log("toolCallCreated event:", toolCall);
    sendEventToAllClients(currentThreadId, { event: "toolCallCreated", data: toolCall });
  })
  .on("toolCallDelta", (toolCallDelta) => {
    console.log("toolCallDelta event:", toolCallDelta);
    sendEventToAllClients(currentThreadId, { event: "toolCallDelta", data: toolCallDelta });
  })
  .on("end", () => {
    console.log("Stream ended for threadId:", currentThreadId);
    sendEventToAllClients(currentThreadId, { event: "end", data: null });
  });

res.status(200).json({ threadId: currentThreadId });

} catch (error) {
console.error(“Error handling /assistant-chat:”, error);
res.status(500).send(“Internal server error”);
}
});

And here is the vision api:

async function analyzeImage(base64Image, contextText) {
// let promptText = “analyze this image as if you are an experienced interviewer at a top management consulting firm (MBB). Provide detailed feedback on how you would evaluate this in the context of a case interview or other consulting-related assessment.”;
// let promptText = “analyze this image as if you are an experienced interviewer at a top management consulting firm (MBB). Provide detailed feedback on how you would evaluate this in the context of a case interview or other consulting-related assessment or in general how would you solve the problem in any in picture.”;
let promptText = “Analyze this image in the context of the following conversation:\n\n”;

if (contextText && contextText !== “”) {
promptText += ${contextText}\n\n;
promptText +=
“Provide detailed feedback on how you would evaluate this image in the context of a case interview, management consulting assessment, or problem-solving scenario based on the conversation context. If the image contains mathematical content, please attempt to solve or analyze it.”;
}

const payload = {
model: “gpt-4-vision-preview”,
messages: [
{
role: “user”,
content: [
{
type: “text”,
text: promptText,
},
{
type: “image_url”,
image_url: {
url: base64Image,
},
},
],
},
],
max_tokens: 500,
};
const response = await openai.chat.completions.create(payload);
const imageAnalysis = response.choices[0].message.content;
return The following response is from vision api I need you to store in chat history So that user can ask questions related to this. "${imageAnalysis}". Also return user response as "The image was analyzed, and the following observations were made: ${imageAnalysis}. How would you like to proceed?" ;
// return The image was analyzed and the following observations were made: ${imageAnalysis}. How would you like to proceed?;
}