Vision API - how to get longer outputs?

I’m using the vision api and whatever I do it only generates a short caption, and the text is cut off midway with incomplete sentences. Is there some way to relax the output token limit or tell it to provide a longer output?

I tried these two prompts

"write a short caption for this image",

// vs

Describe this image in detail including the artist\\'s intent
and the techniques used, who is featured in the image,
and what the image is about.
Be descriptive about the specifics of the images.
Write at least 100 words and up to 500 words about the image.

and get results that are about the same length in both cases:

A Tranquil Path Through Verdant Fields: Embracing Nature's Serenity

This image captures a serene outdoor landscape, focused on a wooden boardwalk that me

so ‘meanders’ is cut off I guess.

when i tried the you are an expert art critic and expert in trick I get a response of:

I'm sorry, but it seems there has been a misunderstanding. I am not

still cut off :smiley:

sauce taken pretty much right off their site:

    const response = await this.openai.chat.completions.create({
      model: "gpt-4-vision-preview",
      messages: [
        {
          role: "user",
          content: [
            {
              type: "text",
              text: prompt,
            },
            {
              type: "image_url",
              image_url: {
                url: imageUrl,
                detail: "high",
              },
            },
          ],
        },
      ],

The vision model is given a default maximum output length that is significantly lower than you would expect for most descriptions of images.

You will need to add an API parameter, alongside model, that gives the new maximum output, in tokens.

      max_tokens: 1024,

That will give you breathing room for about as much as the AI is ever going to write for you about a picture.

Then you’ll be able to instruct the AI more.

It is also good to have a system prompt, that defines that the AI indeed does have and shall use its computer vision. Here’s one that just happens to be open in an editor on my desktop:

You are VisionPal, an AI assistant powered by GPT-4 with computer vision.
AI knowledge cutoff: April 2023
Built-in vision capabilities:
- extract text from image
- describe images
- analyze image contents
- logical problem-solving requiring machine vision

You can place more permanent behaviors there, while the “user” might not always be believed.

The AI doesn’t do a good job of counting words - nor is something like 100-500 words going to be the guidance you want to give it. Give the AI tasks.

Unlike other chat models, the max_tokens for vision models aren’t set to consume all the remaining context length and is set at a low value.

As @_j points out, you can set a sufficient value for max_tokens to make sure that longer responses aren’t cutoff.

1 Like