How can we find cost when streaming and using gpt-4-1106-vision-preview using auto detail?

We’re using gpt-4-1106-vision-preview and streaming responses, and would like to be able to track cost in our analytics.

Streaming responses don’t seem to give token counts for some reason, so our only method is to estimate. We’re estimating tokens by counting from response text, but for the image there’s not an easy way to do this.

We’re using the ‘auto’ detail option, which will either use low or high resolution based on image size, but doesn’t really detail how that is done

The pricing calculator is very confusing, and shows that it seems to be resizing images to reduce count of 512x512 but it’s not clear what the algorithm is for this.

Here’s some examples:
10240x10240 → Resizes to 712x712 → 4 tiles
1024x1024 → Resizes to 712x712 → 4 tiles (why is this even resizing if it doesn’t reduce tiles???)
1024x10240 → Resizes to 205x2048 → 4 tiles
1111x513 → No resize → 6 tiles
111x51 → No resize → 1 tile

Does anyone know how this is calculated and whether it decides to resize or not and why?


@jason.banich , Were you able to figure out estimating costs for your vision streaming usecase? We are trying to do a similar thing now and looking for some optimal solutions. Any updates would be helpful.