Optimizing GPT-4o's Vision Performance?

I’m developing an application that leverages the vision capabilities of the GPT-4o API, following techniques outlined in its cookbook. My approach involves sampling frames at regular intervals, converting them to base64, and providing them as context for completions.

While GPT-4o’s understanding of the provided images is impressive, I’m encountering a bottleneck. In my current implementation, it takes over 10 seconds before GPT-4o streams a completion when given just 10 JPEGs compressed at 10% quality. This latency is proving to be a hurdle for my use case, which requires quick responses.

In contrast, OpenAI’s recent demos of GPT-4o make it seem that it can analyze video streams and answer questions about them almost instantaneously (the video with Sal Khan and his son demonstrates this). This made me wonder:

  1. How did OpenAI achieve such remarkable speed in their demos? Are they employing any proprietary techniques or optimizations that are not available to external developers?

  2. Are there any advanced tricks or enhancements that I can implement to improve the response time of GPT-4o’s vision capabilities, beyond what is provided in the cookbook?

  3. Has anyone in the community successfully optimized GPT-4o’s vision speed to match or come close to the performance demonstrated in OpenAI’s demos? If so, I would be incredibly grateful for any insights or resources you could share.

1 Like

They were most definitely hooked up directly to their servers to begin with. Huge latency improvement.

Secondly, it’s not known how the video works exactly. Keep in mind that it was a tech demo. It’s unproven magic until it lands in our hands so trying to match it is setting yourself up for disappointment.

Even the audio capabilities available in the current application takes (a small) amount of time but isn’t like their demo.

If I had to guess it is taking a picture whenever requested. It’s not actually a video. They just know exactly how to use it, speak it it, and how it’s running in the backend.

Honestly I would just wait. It doesn’t make sense to try and build in parallel to them. Eventually it’ll come out and maybe with a new framework that facilitates it all.

Believe me, you do not want to be one of the thousands that have been bulldozed by OpenAI trying to build alongside them.

Have you considered trying this as a GPT?

1 Like

I’m not sure what you specifically mean that we are “building in parallel,” but we are definitely not just replicating what is shown in the demos. Furthermore, I’m developing a domain-specific product that is integrated into our users’ workflows, so a GPT would not suit my organization’s needs. I agree, though, that there’s more going on behind the scenes in the demos that we aren’t seeing. I’m hoping we can uncover how everything was done.

What I mean to say is that you are building something that attempts to imitate the tech demonstrated in the GPT-4o demo.

It could be (as has been) that this will be available to anyone, possibly alongside the GPT-4o app for phones.

Right, Assistants doesn’t even have Vision capabilities yet, but it’s almost inevitable that they will

1 Like