Im currently developing a simple UI chatbot using nextjs and openai library for javascript and the next problem came:
Currently I have two endpoints: one for normal chat where I pass the model as a parameter (in this case “gpt-4”) and in the other endpoint I pass the gpt-4-vision. So I have two separate EPs to handle images and text.
Is any way to handle both functionalities in just one chat session (like chagpt does right now). The documentation is not clear or gives examples on how to integrate both funcionalities in one chat. Should we upload the file separately and then send it as a message inside the context (image URL, reference?).
like:
{
“role”: “user”,
“content”: Message: ${message}? ImageUrl: {image URL after uploading to openai server}
}
Some help here please? Someone got the same problem before?
Any ideas are welcome.
You can have a conversation thread in vision just like with the chat model:
data = {
"model": "gpt-4-vision-preview",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "assistant",
"content": "Hello! How can I help you today?"
},
{
"role": "user",
"content": [
{"type": "text", "text": "What’s in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
},
},
],
}
]
}
is that what you mean?
if the user uploads an image, or if there are images in the thread you can just switch to vision, otherwise you can stay with turbo to save costs or RPDs.
Im thinking about the pricing here. Yes, I know I could chat normally with the vision model, but that would be costly right? Comparing to “normal gpt-4”. I don’t know how openai handles this, maybe I can iterate over each message and find if there is a image type message in the context? And if not, switch to normal model
if the rate limit is a show stopper, you could add a function/tool to plain old gpt 4: if there are images in your thread, you just mask them. if the user is trying to reference an image, or if an image needs to be referenced for an answer, gpt 4 calls the function and on call, you just send the whole unmasked thing to vision.
that means that some vision calls might be almost twice as expensive in terms of context, but you might be able to optimize that maybe with some word filtering or other heuristics.
Hi @benjamin.bascary , were you able to figure out the pricing thing? For a follow-up question in the same chat session, will it transfer the image in the first message to tokens again?