Optimizing GPT-4o's Vision Performance?

benjuntilla · May 20, 2024, 6:53pm

I’m developing an application that leverages the vision capabilities of the GPT-4o API, following techniques outlined in its cookbook. My approach involves sampling frames at regular intervals, converting them to base64, and providing them as context for completions.

While GPT-4o’s understanding of the provided images is impressive, I’m encountering a bottleneck. In my current implementation, it takes over 10 seconds before GPT-4o streams a completion when given just 10 JPEGs compressed at 10% quality. This latency is proving to be a hurdle for my use case, which requires quick responses.

In contrast, OpenAI’s recent demos of GPT-4o make it seem that it can analyze video streams and answer questions about them almost instantaneously (the video with Sal Khan and his son demonstrates this). This made me wonder:

How did OpenAI achieve such remarkable speed in their demos? Are they employing any proprietary techniques or optimizations that are not available to external developers?
Are there any advanced tricks or enhancements that I can implement to improve the response time of GPT-4o’s vision capabilities, beyond what is provided in the cookbook?
Has anyone in the community successfully optimized GPT-4o’s vision speed to match or come close to the performance demonstrated in OpenAI’s demos? If so, I would be incredibly grateful for any insights or resources you could share.

anon10827405 · May 20, 2024, 7:50pm

They were most definitely hooked up directly to their servers to begin with. Huge latency improvement.

Secondly, it’s not known how the video works exactly. Keep in mind that it was a tech demo. It’s unproven magic until it lands in our hands so trying to match it is setting yourself up for disappointment.

Even the audio capabilities available in the current application takes (a small) amount of time but isn’t like their demo.

If I had to guess it is taking a picture whenever requested. It’s not actually a video. They just know exactly how to use it, speak it it, and how it’s running in the backend.

Honestly I would just wait. It doesn’t make sense to try and build in parallel to them. Eventually it’ll come out and maybe with a new framework that facilitates it all.

Believe me, you do not want to be one of the thousands that have been bulldozed by OpenAI trying to build alongside them.

Have you considered trying this as a GPT?

benjuntilla · May 20, 2024, 8:36pm

I’m not sure what you specifically mean that we are “building in parallel,” but we are definitely not just replicating what is shown in the demos. Furthermore, I’m developing a domain-specific product that is integrated into our users’ workflows, so a GPT would not suit my organization’s needs. I agree, though, that there’s more going on behind the scenes in the demos that we aren’t seeing. I’m hoping we can uncover how everything was done.

anon10827405 · May 20, 2024, 10:06pm

What I mean to say is that you are building something that attempts to imitate the tech demonstrated in the GPT-4o demo.

It could be (as has been) that this will be available to anyone, possibly alongside the GPT-4o app for phones.

Right, Assistants doesn’t even have Vision capabilities yet, but it’s almost inevitable that they will

Topic		Replies	Views
Make OpenAI Vision API Match GPT4 Vision API chatgpt	4	3821	December 6, 2023
GPT-Vision Performance Improvements API gpt-4 , gpt-4-vision	1	1555	January 24, 2024
Can GPT -vision models be accessed using API? API	15	1395	January 7, 2025
When will vision API become available? API gpt-4 , api	9	8078	October 4, 2023
Seeking Advice: Enhancing Accuracy of GPT-4 with Vision API gpt-4 , api , adv-data-analytics , gpt-4-vision , gpt4-vision	5	2853	May 15, 2024

Optimizing GPT-4o's Vision Performance?

Related topics