I tried GPT-o3 API for some VQA tasks, the number reasoning tokens is always less than 1k. And the performance is much worse than in the chatgpt webpage.
Did it really think-with-image as it did in the website CHATGPT page?