GPT-4o-mini high vision cost

But it’s the same resolution, same neural net and arguably more features you are trying to extract?

1 Like

Got he same issue here.
I think it’s not cheap for multimodal reasoning.

It’s not a bug. It’s a feature. :upside_down_face:

1 Like

This doesn’t seem right, despite what OpenAI’s head of stuff said. Anthropic’s Claude 3 models have consistent token usage across models for image processing, meaning for consumers, it is an order of magnitude better to be using Claude 3 Haiku over gpt-4o mini right now for low-cost vision applications.

1 Like

Perhaps. But, if this is true:

There must be some reason for the extreme difference in pricing for the same functionality. Maybe, as has been suggested earlier, somebody hasn’t thought it through.

Agreed.

Or, Gemini Flash/Pro:

What a disappointment I just experienced upon discovering that analyzing images with gpt-4o-mini costs the same as with gpt-4o… I hope they change this soon because it doesn’t make much sense for the price to be the same. In the meantime, I’ll continue using Claude3 Haiku and Gemini Flash.

2 Likes

Except… if the work and its costs for OpenAI and the result for users are same, then same pricing is very logical.

We paid less for texts and such lite-version returns less, ergo it is cheaper.

I was also let down by the (in my opinion) ambiguity of pricing with gpt-4o mini from the news release.

I’ve been extensively testing gpt4o-mini with OCR and found the best results (and cheapest) to be:

  • Use Gemini Flash 1.5 to OCR the document and transform all content to markdown (I like this step because I classify the documents as well)
  • Use GPT-4o Mini to structure the markdown & infer information

GPT-4o Mini just isn’t up to the task of converting to markdown - it seems to leave out important details, and also for whatever reason screws up characters enough times to be a serious issue.

Also, although the i/o text token is much cheaper it still is more expensive than Flash when combined with images.

This way I’ve found takes advantage of both model strengths and still be much cheaper. :muscle:

Would love to hear other people’s opinions and testing results on this.

I’d also like to point out that the model can’t know if it’s a Mona Lisa or Invoice without parsing it first.

I’m wondering if you could though by measuring the entropy in pixels


I actually wrote a quick function in Rust to do this. It looks promising. Efficient tho? Naw LOL

/// Calculates the entropy of an image.
/// ### Parameters
/// - `img_path` - The path to the image
/// - `slice_percentage` - The percentage of the image to sample.\
///     - A sample of 50.0 will run in roughly half the time but may not be as accurate.\
///     - Defaults to 100.0
fn calculate_image_entropy(img_path: &str, slice_percentage: Option<f64>) -> f64 {
    let img = image::open(img_path).unwrap();
    let (width, height) = img.dimensions();

    let lines_to_sample: Vec<u32> = if let Some(percentage) = slice_percentage {
        let num_lines = (height as f64 * percentage / 100.0).round() as u32;
        let mut rng = rand::thread_rng();
        let mut line_indices: Vec<u32> = (0..height).collect();
        line_indices.shuffle(&mut rng);
        line_indices.into_iter().take(num_lines as usize).collect()
    } else {
        (0..height).collect()
    };

    let mut color_counts = HashMap::new();
    let mut total_pixels = 0;

    for y in lines_to_sample {
        for x in 0..width {
            let rgb = img.get_pixel(x, y).to_rgb();
            *color_counts.entry(rgb).or_insert(0) += 1;
            total_pixels += 1;
        }
    }

    color_counts.values().fold(0.0, |entropy, &count| {
        let p = count as f64 / total_pixels as f64;
        entropy - p * p.log2()
    })
}

mod tests {
    use super::*;

    #[test]
    fn test_calculate_image_entropy() {

        let testing_folder_loc = panic!("SET ME BRUH");
        let image_extensions = vec!["jpg", "jpeg", "png", "bmp", "gif", "tiff", "webp"];

        // Gather image related file paths
        let mut image_paths = vec![];
        for extension in image_extensions {
            let pattern = format!("{}/**/*.{}", testing_folder_loc, extension);
            for entry in glob::glob(&pattern).unwrap() {
                match entry {
                    Ok(path) => {
                        image_paths.push(path.to_str().unwrap().to_string());
                    }
                    Err(e) => eprintln!("Error: {}", e),
                }
            }
        }

        let limit = 10;

        for img_path in image_paths.iter().take(limit) {
            let entropy = calculate_image_entropy(img_path, None);
            println!("Path: {}, Entropy: {}", img_path, entropy);
        }
    }
}

I ran a quick test and it looks nice. Any file prefixed with an underscore is an invoice:

Path: /home/—/Pictures/testing/WhatsApp Image 2024----.jpeg, Entropy: 10.737889390988476
Path: /home/—/Pictures/testing/__19—6af4(1).jpeg, Entropy: 2.4047782665299677
Path: /home/—/Pictures/testing/__190d—af4.jpeg, Entropy: 2.4299250255562
Path: /home/—/Pictures/testing/__190d—927.jpeg, Entropy: 1.3859442057582072
Path: /home/—/Pictures/testing/__190—8546.jpeg, Entropy: 1.5946823189806352
Path: /home/—/Pictures/testing/Eddie.png, Entropy: 7.431068607525445
Path: /home/—/Pictures/testing/__190c68f----74.png, Entropy: 1.886469926341877
Path: /home/—/Pictures/testing/DALL·E 2024-07-10 11.57.29 - —.webp, Entropy: 10.95400323794709

This is running using a 50% sample rate of each image

4 Likes

I concur. However, I’ve been successful using Gemini Pro 1.5 for both tasks as described here: Using gpt-4 API to Semantically Chunk Documents - #166 by SomebodySysop.

Unfortunately, Gemini 1.5 Pro and Flash struggle mightily when it comes to strikethrough text:

This PDF: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/2022_Local_161_MOA_09.pdf

Extracts to:
https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/2022_Local_161_MOA_09_gemini_pro01.txt

Pretty darned good. However, when the strikethroughs are in the titles:

This PDF: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/2022_Local_161_MOA_10.pdf

Extracts to:
https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/2022_Local_161_MOA_10.txt

Gemini somehow can’t see that:

Should extract as: “ARTICLE 9. Sick Leave”.

gpt-4o mini and gpt4o do see it, albeit

a. I have to upload PDF pages as individual images.
b. The cost for processing the images, in my opinion, is excessive.

And, in my use case, efficiently removing strikethrough text is critical.

UPDATE: Got code working with Claude Sonnet 3.5 that eliminates all strikethrough text (so far) in images without issue.

2 Likes

I wonder if specifically informing the model to use markdown to indicate strikethrough text could work?

:open_mouth: do share!

Claude is great when it comes to code. Love it.

2 Likes

This is the gpt-4o-mini code I am using:

from openai import OpenAI 
import os

## Set the API key and model name
# MODEL="gpt-4o"
MODEL="gpt-4o-mini"
# gpt-4o-mini does not remove strikethrough text

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "{api-key}"))

response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant that responds in Markdown."},
        {"role": "user", "content": [
            {"type": "text", "text": "Extract the text from this image. Strikethrough text are letters with horizontal lines through them. Exclude all strikethrough text."},
            {"type": "image_url", "image_url": {
                "url": "https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/2022_Local_161_MOA_10/page_1.jpg"}
            }
        ]}
    ],
    temperature=0.0,
)

print(response.choices[0].message.content)

This is the output:

  1. Sick Leave

Modify Article 9 of the Local #161 Motion Picture Theatrical and TV Series Production Agreement (and make conforming changes to Article 41 of the Local #161 Supplemental Digital Agreement) as follows:

“ARTICLE 9. WAIVER OF NEW YORK CITY EARNED SICK TIME ACT AND SIMILAR LAWS SICK LEAVE

etc…

Strikethrough text still included. Should say “ARTICLE 9 SICK LEAVE”. If you can think of a prompt that will get gpt-4o-mini to do it, I’m all ears!

As for working Claude code, I am calling it through the Google Vertex AI API. Here is what it does:

  1. Converts PDF to images.
  2. Uploads images to AWS S3 bucket.
  3. Calls sonnet-3.5 to extract text (minus strikeout)

This code is using the PyMuPdf, AWS SDK and AnthropicVertex SDK Python libraries.

import fitz  # PyMuPDF
import os
import sys
import boto3
import base64
import httpx
from anthropic import AnthropicVertex

def pdf_to_jpeg(pdf_path, output_folder):
    """Converts a PDF to JPEG images.
    Args:
        pdf_path: The path to the PDF file.
        output_folder: The folder where the JPEG images will be saved.
    """
    os.makedirs(output_folder, exist_ok=True)
    doc = fitz.open(pdf_path)
    for page_num in range(doc.page_count):
        page = doc[page_num]
        pix = page.get_pixmap(matrix=fitz.Matrix(300/72, 300/72))  # Adjust DPI as needed
        pix.save(os.path.join(output_folder, f"page_{page_num + 1}.jpg"))
    doc.close()

def upload_to_s3(local_folder, s3_bucket, s3_output_key):
    """Uploads the contents of a local folder to an S3 bucket.
    Args:
        local_folder: The local folder to upload.
        s3_bucket: The S3 bucket to upload to.
        s3_output_key: The S3 key (directory) to upload to.
    """
    s3_resource = boto3.resource('s3')
    
    for root, dirs, files in os.walk(local_folder):
        for file in files:
            local_file_path = os.path.join(root, file)
            relative_path = os.path.relpath(local_file_path, local_folder)
            s3_file_key = f"{s3_output_key}/{relative_path}".replace('//', '/')
            s3_resource.Bucket(s3_bucket).upload_file(local_file_path, s3_file_key)
    print(f'Output folder {local_folder} uploaded to s3://{s3_bucket}/{s3_output_key}')

def encode_image(url, media_type):
    return {
        "type": "image",
        "source": {
            "type": "base64",
            "media_type": media_type,
            "data": base64.b64encode(httpx.get(url).content).decode("utf-8"),
        },
    }

def process_images_with_anthropic(s3_bucket, s3_output_key, project_id, max_tokens=4096):
    LOCATION = "europe-west1"  # or "us-east5"
    client = AnthropicVertex(region=LOCATION, project_id=project_id)

    prompt = """
    You are a very professional image to text document extractor.
    Please extract the text from these images, treating them as pages of a PDF document. 
    A strikethrough is a horizontal line drawn through text, used to indicate the deletion of an error or the removal of text.  Ensure that all strikethrough text is excluded from the output. 
    Try to format any tables found in the images. 
    Do not include page numbers, page headers, or page footers.
    Please double-check to make sure that any words in all capitalized letters with strikethrough letters are excluded.
    Return only the extracted text.  No commentary.
    **Exclude Strikethrough:** Do not include any strikethrough words in the output. Even if the strikethrough words are in a title.
    **Include Tables:** Tables should be preserved in the extracted text.
    **Exclude Page Headers, Page Footers, and Page Numbers:** Eliminate these elements which are typically not part of the main content.
    """

    s3_client = boto3.client('s3')
    response = s3_client.list_objects_v2(Bucket=s3_bucket, Prefix=s3_output_key)

    content = []
    for obj in response.get('Contents', []):
        if obj['Key'].endswith('.jpg'):  # Ensure we're only processing jpg files
            url = f"https://s3.us-west-2.amazonaws.com/{s3_bucket}/{obj['Key']}"
            content.append(encode_image(url, "image/jpeg"))
    
    content.append({
        "type": "text",
        "text": prompt
    })

    message = client.messages.create(
        max_tokens=max_tokens,
        messages=[
            {
                "role": "user",
                "content": content,
            }
        ],
        model="claude-3-5-sonnet@20240620",
    )

    return message.content[0].text, message.usage

if __name__ == "__main__":
    if len(sys.argv) != 6:
        print("Usage: python script.py <pdf_path> <output_folder> <s3_bucket> <s3_output_key> <project_id>")
        sys.exit(1)

    pdf_path = sys.argv[1]
    output_folder = sys.argv[2]
    s3_bucket = sys.argv[3]
    s3_output_key = sys.argv[4]
    project_id = sys.argv[5]

    # Convert PDF to JPEG
    pdf_to_jpeg(pdf_path, output_folder)

    # Construct the full s3_output_key including the folder name
    folder_name = os.path.basename(output_folder)
    full_s3_output_key = f"{s3_output_key.rstrip('/')}/{folder_name}"

    # Upload to S3
    upload_to_s3(output_folder, s3_bucket, full_s3_output_key)

    # Process images with AnthropicVertex
    response, usage = process_images_with_anthropic(s3_bucket, full_s3_output_key, project_id)

    # Print the response
    print("Extracted text from all pages:")
    print(response)

    # Print the usage tokens
    print("\nUsage tokens:")
    print(f"Input tokens: {usage.input_tokens}")
    print(f"Output tokens: {usage.output_tokens}")

This works perfectly – except it is limited to 4096 output tokens – as is gpt-4o, which also works.

2 Likes

i am facing same issue hoply its only bug if not mini is useless

With the price adjustment of Gemini Flash on August 12th the difference will be even bigger. Flash will be 275x cheaper with vision using my example figures and it peforms great for my use cases ( extracting information from documents).

Text will also be 50% cheaper.

Even with a lot of in context examples it will be incredibly cheap

2 Likes