How does the image OCR actually work in gpt-4?

I asked gpt-4 to process my receipt image and return in a json format. It does work pretty accurately. Moreover, it provides the View analysis which includes Python code that uses pytesseract to generate the result. However, I’ve tried to run the provided code locally, and it got much worse accuracy.

This brings to my question on how chatgpt actually work? My assumption was that the model reads the prompt and then generates the python code which is then used to get answer. This also seems to match what’s described in this official doc

However, based on my local run, this assumption doesn’t seem to be true. This really confuses me a lot. Can anyone help confirm how this works?

When you use the vision model it is not actually running the mentioned python code. It’s running proprietary machine learning models that we cannot access without using OpenAI’s services.

It’s explanation of code is a hallucination. It does not generate code unless you ask for it to perform the OCR in Code Interpreter