GPT-4 API and image input

I asked GPT4 to guess what the API call would look like it said:

As of my last training data in January 2022, OpenAI hadn’t publicly released a specific “ChatGPT-4” API. My training data includes information up to ChatGPT-3, and there was an API for that.

However, let’s assume the API would be somewhat similar to ChatGPT-3’s. In the case of ChatGPT-3, the API call would be something like:

python

import requests

def call_chatgpt4_to_transcribe(image_path):
    url = "https://api.openai.com/v1/engines/chatgpt-4.0-turbo/completions"  # Assuming hypothetical endpoint

    headers = {
        'Authorization': 'Bearer YOUR_OPENAI_API_KEY',
        'Content-Type': 'application/json',
    }

    data = {
        "prompt": "Transcribe the following image:",  # This is just an example; the actual API might require direct image input
        "image": image_path  # Hypothetical, since ChatGPT-3 doesn't support image input directly
    }

    response = requests.post(url, headers=headers, json=data)
    result = response.json()

    # Assuming the transcribed text is available under the 'choices' key
    return result['choices'][0]['text']

No, the AI can’t answer in any meaningful way.

You can expect when the API is turned on, that role message “content” schema will also take a list (array) type instead of just a string.

Array elements can then be the normal string of a prompt, or a dictionary (json) with a key of the data type “image” and bytestream encoded image data as the value.

Okay, let’s consider other APIs with file input, it will not be just path. Based on description by people with access to ChatGPT with image analysis enabled, it seems you can upload several images. From other post, it seems under Chat completions API…

const completion = await openai.chat.completions.create({
    messages: [
      { role: "system", content: "You are a helpful assistant." }, 
     { role: "user", content: "Here are two pictures of vegetables I took from the supermarket, which one will be good for curry?" }
],
    files: [
        fs.createReadStream("IMG0001.png"),
        fs.createReadStream("IMG0002.png")
    ],
    model: "gpt-4-vision",
  });

Of course, these are just wild guesses, for science! Can’t wait until DevDay when they unveil everything.

1 Like

Thank you, supershaneski and fred4! We will try!

1 Like

How did this site become Reddit, where you simply just make your nonsense look plausible?
You will: try supplying a “prompt” parameter to a nonexistent chat model through an “engines” endpoint and fail… :smile:

If you like code:

import os
import openai

openai.api_key = os.getenv("OPENAI_API_KEY")
image = b"..."  # binary image stream
completion = openai.ChatCompletion.create(
    model="gpt-4-0xxx",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant and can describe images.",
        },
        {
            "role": "user",
            "content": ["What's in this screenshot?", {"image": image}],
        },
    ],
)
print(completion["choices"][0]["message"]["content"])

… as demonstrated seven months ago

1 Like

{
“role”: “user”,
“content”: [“What’s in this screenshot?”, {“image”: image}],
},

This makes sense as I cannot imagine how to put the image in context with separate file input parameter. It definitely needs to be attached in the messages!

1 Like

What if we have GPT-4-32K (Enterprise)? Could we use the API for a vision task?

The only “enterprise” offering by OpenAI that would mention 32k is ChatGPT Enterprise. It is not “the API”, it is a business-oriented version of ChatGPT, with centralized billing and user management to the web interface chatbot.

2 Likes

Hi,
could you share the code because i am using the api key of gpt4 and it is working on text but i want to know how it will work on image.

See gist dot github dot com/abrichr/259b83b72fe0055d4f27ca9b3d387bc7 for a script that queries the OpenAI API with images and text.

Reproduced here:

"""Example usage of GPT4-V API.

Usage:

    OPENAI_API_KEY=<your_api_key> python3 gpt4v.py \
        [<path/to/image1.png>] [<path/to/image2.jpg>] [...] "text prompt"

Example:

    OPENAI_API_KEY=xxx python3 gpt4v.py photo.png "What's in this photo?"
"""

from pprint import pprint
import base64
import json
import mimetypes
import os
import requests
import sys


api_key = os.getenv("OPENAI_API_KEY")


def encode_image(image_path: str):
    """Encodes an image to base64 and determines the correct MIME type."""
    mime_type, _ = mimetypes.guess_type(image_path)
    if mime_type is None:
        raise ValueError(f"Cannot determine MIME type for {image_path}")

    with open(image_path, "rb") as image_file:
        encoded_string = base64.b64encode(image_file.read()).decode('utf-8')
        return f"data:{mime_type};base64,{encoded_string}"


def create_payload(images: list[str], prompt: str, model="gpt-4-vision-preview", max_tokens=100, detail="high"):
    """Creates the payload for the API request."""
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": prompt,
                },
            ],
        },
    ]

    for image in images:
        base64_image = encode_image(image)
        messages[0]["content"].append({
            "type": "image_url",
            "image_url": {
                "url": base64_image,
                "detail": detail,
            }
        })

    return {
        "model": model,
        "messages": messages,
        "max_tokens": max_tokens
    }


def query_openai(payload):
    """Sends a request to the OpenAI API and prints the response."""
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}"
    }
    response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
    return response.json()


def main():
    if len(sys.argv) < 3:
        print("Usage: python script.py [image1.jpg] [image2.png] ... \"Text Prompt\"")
        sys.exit(1)

    # All arguments except the last one are image paths
    image_paths = sys.argv[1:-1]

    # The last argument is the text prompt
    prompt = sys.argv[-1]

    payload = create_payload(image_paths, prompt)
    response = query_openai(payload)
    pprint(response)


if __name__ == "__main__":
    main()