Integrating Vision with Assistant API

yvoderooij · March 27, 2024, 6:31am

Hi all,

Hope you can help me with this. I have a functional Assistant API built with Node.js, and I’m looking to enhance the user experience by enabling users to add images within the same conversation. The assistant will analyze these images and seamlessly continue the conversation based on the analysis.

Does anyone have experience/code to share with me? Would be great!

merefield · March 27, 2024, 6:54am

My PR (albeit in RoR) might help:

yvoderooij · March 27, 2024, 2:08pm

Thanks @merefield

Can you maybe explain the logic steps?

Send image to the backend
Have the Vision function seperate
When the backend route receives an image from the frontend, trigger the Vision function and analyze the data
Stream the data back to the frontend

Is this also your approach?

merefield · March 27, 2024, 2:18pm

I’d appreciate if you could please read the code. Should be very readable and clear.

merefield · April 2, 2024, 10:18am

I have a little more time now. Apologies if I was a bit blunt, but I have would appreciated if your question would have demonstrated you had read my code. I wasn’t convinced! This would have saved me time if you could have been more specific in your question.

No, that’s not the flow with my (Discourse) Chatbot. It works like this:

At some point in the conversation, an image is uploaded to the forum (in this case, Discourse)
User asks a question about the image.
Send the list of functions to the Chat Completion model (which includes the local Vision function) and include the query the User just made.
I believe LLM works out that it should call the local Vision function because it identifies a strong semantic relationship between the function definition and the User query (this is all done on Open AI side).
LLM responds with function call including, potentially, but optionally, a short phrase that represents what the user asked about the image.
My local code handles the response and calls the Vision function.
The Vision function unpacks the query parameter, finds the image from the current conversation on the forum, gets its URL from the Uploads rails model (a table of uploads) and then sends the query and the image URL specifically to the GPT 4 Vision model (or whatever is set in settings for Vision). NB the image must be public!
The response is packaged up and sent back as the answer to the Chat Completions model.
The LLM then responds to the User with its repackage of the answer.

yvoderooij · April 2, 2024, 11:13am

Hi @merefield ,

Appreciate the message! I was just confused by your Github Repo, so was curious to hear your approach. Anyway, thanks for the reply

merefield · April 2, 2024, 12:03pm

Yeah the whole repo would be overwhelming, that’s why I shared the PR. If you look at “Files Changed” on that PR it may be easier to digest.

yvoderooij · April 16, 2024, 4:30pm

If anyone is interested in this, I got it working

Let me know if you want to see the code snippet! Next up: text-to-speech

Offtopic: do you guys think Vision and text-to-speech will be integrated into the assistant-api’s any time soon?

nishant.s · April 29, 2024, 7:50am

Hi, can you share the code. I am creating a chatbot where I want user to upload an image during a conversation, I have different processing pipeline for image and I do not want GPT4V. I am confused about how should I validate whether the user has uploaded an image or not.

yvoderooij · April 29, 2024, 8:29am

Yes sure, here it is:

router.post(“/assistant-chat”, async (req, res) => {
const { threadId, imageUrl, contextText } = req.body;
let { userMessage } = req.body;

try {
let currentThreadId = threadId;
// If a threadId is provided and exists, continue, otherwise, create a new thread
if (!currentThreadId || !threadResponses[currentThreadId]) {
// Create a new thread for the new case
const threadResponse = await openai.beta.threads.create();
currentThreadId = threadResponse.id;
console.log(“New thread created with ID:”, currentThreadId);

  // Initialize storage for this new thread
  threadResponses[currentThreadId] = { events: [], clients: [] };
} else {
  console.log("Continuing conversation on thread:", currentThreadId);
}

// Here, before sending the user's message to OpenAI, signal the start of a new message
sendEventToAllClients(currentThreadId, { event: "messageStart", data: {} });

// check if user submit image
if (imageUrl) {
  const visionResponse = await analyzeImage(imageUrl, contextText);
  userMessage = visionResponse;
}

// Add the user's message to the thread
const messageResponse = await openai.beta.threads.messages.create(currentThreadId, {
  role: "user",
  content: userMessage,
});
console.log("User message added to the thread:", messageResponse);

// Stream the Run using the newly created or existing thread ID
const stream = openai.beta.threads.runs
  .createAndStream(currentThreadId, {
    assistant_id: assistantIdToUse, // Ensure this variable is correctly defined
  })
  .on("textCreated", (text) => {
    console.log("textCreated event:", text);
    sendEventToAllClients(currentThreadId, { event: "textCreated", data: text });
  })
  .on("textDelta", (textDelta) => {
    // Optionally log textDelta events
    console.log("textDelta event Carl:", textDelta);
    sendEventToAllClients(currentThreadId, { event: "textDelta", data: textDelta });
  })
  .on("toolCallCreated", (toolCall) => {
    console.log("toolCallCreated event:", toolCall);
    sendEventToAllClients(currentThreadId, { event: "toolCallCreated", data: toolCall });
  })
  .on("toolCallDelta", (toolCallDelta) => {
    console.log("toolCallDelta event:", toolCallDelta);
    sendEventToAllClients(currentThreadId, { event: "toolCallDelta", data: toolCallDelta });
  })
  .on("end", () => {
    console.log("Stream ended for threadId:", currentThreadId);
    sendEventToAllClients(currentThreadId, { event: "end", data: null });
  });

res.status(200).json({ threadId: currentThreadId });

} catch (error) {
console.error(“Error handling /assistant-chat:”, error);
res.status(500).send(“Internal server error”);
}
});

yvoderooij · April 29, 2024, 8:29am

And here is the vision api:

async function analyzeImage(base64Image, contextText) {
// let promptText = “analyze this image as if you are an experienced interviewer at a top management consulting firm (MBB). Provide detailed feedback on how you would evaluate this in the context of a case interview or other consulting-related assessment.”;
// let promptText = “analyze this image as if you are an experienced interviewer at a top management consulting firm (MBB). Provide detailed feedback on how you would evaluate this in the context of a case interview or other consulting-related assessment or in general how would you solve the problem in any in picture.”;
let promptText = “Analyze this image in the context of the following conversation:\n\n”;

if (contextText && contextText !== “”) {
promptText += ${contextText}\n\n;
promptText +=
“Provide detailed feedback on how you would evaluate this image in the context of a case interview, management consulting assessment, or problem-solving scenario based on the conversation context. If the image contains mathematical content, please attempt to solve or analyze it.”;
}

Topic		Replies	Views
Can Assistants API understand image files uploaded? API	11	11785	September 28, 2024
Which is correct model for image analysis? API	5	374	December 9, 2024
Inputting an image in the Assistant API using the new vision model API gpt-4-vision , assistants-api	9	5303	July 16, 2024
Unable to directly analyze or view the content of files like (local) images API chat-completion , gpt-4-vision	3	1135	November 7, 2024
Asisstant API for querying image API	5	510	June 25, 2024

Integrating Vision with Assistant API

Related topics