Huge escalation in input tokens when using code_interpreter

I am passing in about 12 images into a o3 call via the Responses API.

Without the just released code_interpreter tool it used 8,953 input tokens.
With code_interpreter it used 129,487 input tokens.

What is going on?

This is the only addition to the call:

tools = [{
    "type": "code_interpreter",
    "container": { "type": "auto" }
}],
parallel_tool_calls=True,

That is because the model is now able to perform image manipulation using code interpreter to better understand the images. But each time it generates a new image it is recursively used as input for more reasoning, particularly with higher reasoning levels.

Our latest reasoning models o3 and o4-mini are trained to use Code Interpreter to deeply understand images. They can crop, zoom in, rotate, and perform other image processing techniques to boost their visual intelligence.

While that brings better results, the costs can go pretty high…

You can set up max_output_tokens to help reduce costs, but it may lead to some failed results.

1 Like

It is more about iteration, using python multiple times, while carrying along images for vision that were uploaded in context and the internal iterator of Responses calling the AI again and again.


A little demo of code interpreter working on files:

I then delete the input message and only work on the container mount point files, which persists for 20 minutes of inactivity before another billed session. Let’s have AI discuss how it can “see”:

Lesson 1

Lesson 2

AI writes code:

from IPython.display import display

# Open and display the thumbnail image 1.png
thumbnail_path = "/mnt/data/1.png"
with Image.open(thumbnail_path) as img:
    display(img)
    # For further description, let's analyze the image
    img_info = {
        "format": img.format,
        "mode": img.mode,
        "size": img.size
    }

img_info

Consumption of the last call and the previous messages was 1921 input tokens by gpt-4.1.


So you can see unless you are asking, you won’t have an AI observing things like its graphs. It is that the images and other chat history are run over and over, growing a context of tribulations with code.

There were several errors produced by the code-writing, and the Playground code completely obliterating the streamed content that was being received. Thus you see the deliberate instructions required; the AI is not intuitive. A dumb AI may loop over and over with its Traceback code, while carrying your input vision files.

1 Like

That’s what I suspected too. But I inspected the code_interpreter calls in the response. It performed 8 image manipulation, each manipulation manipulates a single image. If 12 images are 8,953 tokens, the escalation to 129,487 tokens cannot be explained by this. Something else is going on.

I read from another thread that when you add a tool, OpenAI adds a wall of system prompt that counts as input token. I am not sure how true that is.

Not quite: the python tool message is pretty short, leaving a lot of behavior and proper use up to you. Other internal tools are far worse and have message injections competing with your desired operation.

Notable: OpenAI demotes you in instruction-providing hierarchy, and you are no longer prompting before tools…

system

Knowledge cutoff: 2024-06

Image input capabilities: Enabled

# Tools

## python

When you send a message containing Python code to python, it will be executed in a stateful Jupyter notebook environment. python will respond with the output of the execution or time out after 600 seconds. The drive at '/mnt/data' can be used to save and persist user files. Internet access for this session is disabled. Do not make external web requests or API calls as they will fail.
Users may also refer to this tool as code interpreter.

system
{your developer message}

You’ll likely want to use the file-id-attach method of sending images directly into the code interpreter container.

"container": { "type": "auto", files: ["file-1", "file-2"] }

If the AI doesn’t need to be looking with vision, don’t let it look by using user messages.

Well, since we can’t actually “see” the raw reasoning, you are right to suspect that something might be wrong. But there isn’t much we can do about that.

Each time code interpreter gives back a processed image it will count again. Another possibility would be to inspect the container to try to check how many files it generated, but that doesn’t solve the problem of high usage.

I’m curious about one thing, did the model at least give you a correct response?

Just gonna leave this here - so you can integrate some platform knowledge into your system messages, such as typical scripting workflows for Python tasks.

That way, the AI isn’t writing exploratory code in multiple turns trying to get something to work in the Jupyter notebook against unknown library modules and unknown versions.

A list of all installed modules for the notebook environment.

_argon2_cffi_bindings==unknown
_distutils_hack==unknown
plotly_future==unknown
_plotly_utils==unknown
_pytest==unknown
_soundfile==unknown
_yaml==unknown
absl==unknown
absl-py==2.1.0
absl_py==unknown
ace-tools==0.0.1
ace_tools==unknown
aeppl==0.0.31
aesara==2.7.3
affine==2.4.0
aiohttp==3.9.5
aiosignal==1.3.2
analytics==unknown
analytics-python==1.4.post1
analytics_python==unknown
annotated-types==0.7.0
annotated_types==unknown
anyio==4.8.0
anytree==2.8.0
argon2==unknown
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
argon2_cffi==unknown
argon2_cffi_bindings==unknown
arrow==1.3.0
arviz==0.20.0
asn1crypto==1.5.1
asttokens==3.0.0
async-lru==2.0.4
async_lru==unknown
attr==unknown
attrs==25.1.0
audioread==3.0.1
babel==2.17.0
backoff==1.10.0
basemap==1.3.9
basemap-data==1.3.2
basemap.libs==unknown
basemap_data==unknown
bcrypt==4.2.1
beautifulsoup4==4.13.3
bin==unknown
bleach==6.2.0
blinker==1.9.0
blis==0.7.11
blosc2==2.0.0
bokeh==2.4.0
branca==0.8.1
brotli==1.1.0
bs4==unknown
bytecode==0.16.1
cachetools==5.5.2
cairocffi==1.7.1
cairosvg==2.5.2
camelot==unknown
camelot-py==0.10.1
camelot_py==unknown
catalogue==2.0.10
catboost==1.2.7
cattr==unknown
cattrs==24.1.2
certifi==2021.1.10.2
cffi==1.17.1
chardet==3.0.4
charset-normalizer==2.1.1
charset_normalizer==unknown
click==8.1.8
click-plugins==1.1.1
click_plugins==unknown
cligj==0.7.2
cloudpickle==3.1.1
cmudict==1.0.32
comm==0.2.2
confection==0.1.5
cons==0.4.2
contourpy==1.3.1
countryinfo==0.1.2
coverage==7.5.4
cpuinfo==unknown
crypto==unknown
cryptodome==unknown
cryptography==3.4.8
cssselect2==0.7.0
cv2==unknown
cycler==0.12.1
cymem==2.0.11
cython==0.29.36
databricks==unknown
databricks-sql-connector==0.9.1
databricks_sql_connector==unknown
datadog==0.49.1
dateutil==unknown
dateutil-stubs==unknown
ddsketch==3.0.1
ddtrace==2.8.7
debugpy==1.8.12
decorator==4.4.2
defusedxml==0.7.1
deprecated==1.2.18
dlib==19.24.2
dns==unknown
dnspython==2.7.0
docx==unknown
docx2txt==0.8
dot_parser==unknown
dotenv==unknown
einops==0.3.2
email-validator==2.2.0
email_validator==2.2.0
envier==0.6.1
et-xmlfile==2.0.0
et_xmlfile==2.0.0
etuples==0.3.2
exchange-calendars==3.4
exchange_calendars==unknown
executing==2.2.0
faker==8.13.2
fastapi==0.111.0
fastapi-cli==0.0.7
fastapi_cli==unknown
fastjsonschema==2.21.1
fastprogress==1.0.3
ffmpeg==unknown
ffmpeg-python==0.2.0
ffmpeg_python==unknown
ffmpy==0.5.0
filelock==3.16.1
fiona==1.9.2
fiona.libs==unknown
fitz==unknown
flask==3.1.0
flask-cachebuster==1.0.0
flask-cors==5.0.1
flask-login==0.6.3
flask_cachebuster==unknown
flask_cors==unknown
flask_login==unknown
folium==0.12.1
fonttools==4.56.0
fpdf==1.7.2
fqdn==1.5.1
frozenlist==1.5.0
fsspec==2024.10.0
functorch==unknown
future==1.0.0
fuzzywuzzy==0.18.0
gensim==4.3.1
geographiclib==1.52
geopandas==0.10.2
geopy==2.2.0
google==unknown
gradio==2.2.15
graphviz==0.17
gtts==2.2.3
h11==0.14.0
h2==4.2.0
h5netcdf==1.5.0
h5py==3.8.0
h5py.libs==unknown
hpack==4.1.0
html5lib==1.1
httpcore==1.0.7
httptools==0.6.4
httpx==0.28.1
hypercorn==0.14.3
hyperframe==6.1.0
idna==3.10
imageio==2.37.0
imageio-ffmpeg==0.6.0
imageio_ffmpeg==unknown
imbalanced-learn==0.12.4
imbalanced_learn==unknown
imblearn==0.0
imgkit==1.2.2
importlib-metadata==8.5.0
importlib-resources==6.5.2
importlib_metadata==8.5.0
importlib_resources==6.5.2
iniconfig==2.0.0
ipykernel==6.29.5
ipykernel_launcher==unknown
ipython==8.32.0
ipython-genutils==0.2.0
ipython_genutils==unknown
isodate==0.7.2
isoduration==20.11.0
isympy==unknown
itsdangerous==2.2.0
jax==0.2.28
jedi==0.19.2
jinja2==3.1.4
joblib==1.4.2
json5==0.10.0
jsonpickle==4.0.2
jsonpointer==3.0.0
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
jsonschema_specifications==unknown
jupyter==unknown
jupyter-client==8.6.1
jupyter-core==5.5.1
jupyter-events==0.12.0
jupyter-lsp==2.2.5
jupyter-server==2.14.0
jupyter-server-terminals==0.5.3
jupyter_client==8.6.1
jupyter_core==5.5.1
jupyter_events==unknown
jupyter_lsp==unknown
jupyter_server==2.14.0
jupyter_server_terminals==0.5.3
jupyterlab==4.1.8
jupyterlab-pygments==0.3.0
jupyterlab-server==2.27.1
jupyterlab_plotly==unknown
jupyterlab_pygments==0.3.0
jupyterlab_server==2.27.1
jwt==unknown
kanren==unknown
keras==2.6.0
kerykeion==2.1.16
kiwisolver==1.4.8
korean-lunar-calendar==0.3.1
korean_lunar_calendar==unknown
langcodes==3.5.0
language-data==1.3.0
language_data==1.3.0
lazy-loader==0.4
lazy_loader==0.4
libfuturize==unknown
libpasteurize==unknown
librosa==0.8.1
lightgbm==4.5.0
llvmlite==0.44.0
logical-unification==0.4.3
logical_unification==unknown
loguru==0.5.3
lxml==5.3.1
marisa-trie==1.2.1
marisa_trie==unknown
markdown-it-py==3.0.0
markdown2==2.5.3
markdown_it==unknown
markdown_it_py==unknown
markdownify==0.9.3
markupsafe==3.0.2
matplotlib==3.6.3
matplotlib-inline==0.1.7
matplotlib-venn==0.11.6
matplotlib_inline==unknown
matplotlib_venn==unknown
mdurl==0.1.2
minikanren==1.0.1
mistune==3.1.2
mizani==0.10.0
mne==0.23.4
monotonic==1.6
moviepy==1.0.3
mpl_toolkits==unknown
mpmath==1.3.0
msgpack==1.1.0
mtcnn==0.1.1
multidict==6.1.0
multipart==unknown
multipledispatch==1.0.0
munch==4.0.0
murmurhash==1.0.12
mutagen==1.45.1
nacl==unknown
nashpy==0.0.35
nbclassic==0.4.5
nbclient==0.10.2
nbconvert==7.16.6
nbformat==5.10.4
nest-asyncio==1.6.0
nest_asyncio==unknown
networkx==2.8.8
nltk==3.9.1
notebook==6.5.1
notebook-shim==0.2.4
notebook_shim==0.2.4
numba==0.61.0
numexpr==2.10.2
numpy==1.24.0
numpy-financial==1.0.0
numpy.libs==unknown
numpy_financial==unknown
odf==unknown
odfpy==1.4.1
opencv-python==4.5.5.62
opencv_python==unknown
opencv_python.libs==unknown
openpyxl==3.0.10
openssl==unknown
opentelemetry==unknown
opentelemetry-api==1.30.0
opentelemetry_api==unknown
opt-einsum==3.4.0
opt_einsum==3.4.0
orjson==3.10.15
oscrypto==1.3.0
overrides==7.7.0
packaging==24.2
pandas==1.5.3
pandocfilters==1.5.1
paramiko==3.5.1
parso==0.8.4
past==unknown
pathlib-abc==0.1.1
pathlib_abc==0.1.1
pathy==0.11.0
patsy==1.0.1
pdf2image==1.16.3
pdfkit==0.6.1
pdfminer==unknown
pdfminer.six==20220319
pdfplumber==0.6.2
pdfrw==0.4
pexpect==4.9.0
pil==unknown
pillow==9.2.0
pillow.libs==unknown
pip==24.0
pkg_resources==unknown
platformdirs==4.3.6
plotly==5.3.0
plotlywidget==unknown
plotnine==0.10.1
pluggy==1.5.0
pooch==1.8.2
pptx==unknown
preshed==3.0.9
priority==2.0.0
proglog==0.1.10
prometheus-client==0.21.1
prometheus_client==0.21.1
prompt-toolkit==3.0.50
prompt_toolkit==3.0.50
pronouncing==0.2.0
propcache==0.3.0
protobuf==5.29.3
psutil==7.0.0
ptyprocess==0.7.0
pure-eval==0.2.3
pure_eval==0.2.3
py==unknown
py-cpuinfo==9.0.0
py_cpuinfo==unknown
pycountry==20.7.3
pycparser==2.22
pycryptodome==3.21.0
pycryptodomex==3.21.0
pydantic==2.9.2
pydantic-core==2.23.4
pydantic-extra-types==2.10.2
pydantic-settings==2.8.1
pydantic_core==2.23.4
pydantic_extra_types==unknown
pydantic_settings==unknown
pydot==1.4.2
pydub==0.25.1
pydyf==0.11.0
pygments==2.19.1
pygraphviz==1.7
pyjwt==2.10.1
pylab==unknown
pylog==1.1
pyluach==2.2.0
pymc==4.0.1
pymupdf==1.21.1
pynacl==1.5.0
pyopenssl==21.0.0
pypandoc==1.6.3
pyparsing==3.2.1
pypdf2==1.28.6
pyphen==0.17.2
pyproj==3.6.1
pyproj.libs==unknown
pyprover==0.5.6
pyshp==2.3.1
pyswisseph==2.10.3.2
pyswisseph.libs==unknown
pytesseract==0.3.8
pytest==8.2.2
pytest-asyncio==0.23.8
pytest-cov==5.0.0
pytest-json-report==1.5.0
pytest-metadata==3.1.1
pytest_asyncio==unknown
pytest_cov==unknown
pytest_json_report==unknown
pytest_jsonreport==unknown
pytest_metadata==unknown
pyth==unknown
pyth3==0.7
python-dateutil==2.9.0.post0
python-docx==0.8.11
python-dotenv==1.0.1
python-json-logger==2.0.7
python-multipart==0.0.20
python-pptx==0.6.21
python_dateutil==unknown
python_docx==unknown
python_dotenv==unknown
python_json_logger==unknown
python_multipart==unknown
python_pptx==unknown
pythonjsonlogger==unknown
pyttsx3==2.90
pytz==2025.1
pywavelets==1.8.0
pywt==unknown
pyximport==unknown
pyxlsb==1.0.8
pyyaml==6.0.2
pyzbar==0.1.8
pyzmq==26.2.1
pyzmq.libs==unknown
qrcode==7.3
rapidfuzz==3.10.1
rarfile==4.0
rasterio==1.3.3
rasterio.libs==unknown
rdflib==6.0.0
referencing==0.36.2
regex==2024.11.6
reportlab==3.6.12
reportlab.libs==unknown
requests==2.31.0
resampy==0.4.3
rfc3339-validator==0.1.4
rfc3339_validator==unknown
rfc3986-validator==0.1.1
rfc3986_validator==unknown
rich==13.9.4
rich-toolkit==0.13.2
rich_toolkit==unknown
rpds==unknown
rpds-py==0.23.1
rpds_py==unknown
scikit-image==0.20.0
scikit-learn==1.1.3
scikit_image==unknown
scikit_learn==unknown
scikit_learn.libs==unknown
scipy==1.14.1
scipy.libs==unknown
seaborn==0.11.2
segment==unknown
send2trash==1.8.3
setuptools==65.5.1
shap==0.39.0
shapefile==unknown
shapely==1.7.1
shellingham==1.5.4
six==1.17.0
skimage==unknown
sklearn==unknown
slicer==0.0.7
smart-open==6.4.0
smart_open==unknown
sniffio==1.3.1
snowflake==unknown
snowflake-connector-python==2.7.12
snowflake_connector_python==unknown
snuggs==1.4.7
soundfile==0.10.2
soupsieve==2.6
spacy==3.4.4
spacy-legacy==3.0.12
spacy-loggers==1.0.5
spacy_legacy==unknown
spacy_loggers==unknown
sqlparse==0.5.3
srsly==2.5.1
stack-data==0.6.3
stack_data==unknown
starlette==0.37.2
statsmodels==0.13.5
svglib==1.1.0
svgwrite==1.4.1
sympy==1.13.1
tables==3.8.0
tables.libs==unknown
tabula==1.0.5
tabulate==0.9.0
tenacity==9.0.0
terminado==0.18.1
tests==unknown
text-unidecode==1.3
text_unidecode==unknown
textblob==0.15.3
thinc==8.1.12
threadpoolctl==3.5.0
thrift==0.21.0
tifffile==2025.2.18
tinycss2==1.4.0
tlz==unknown
toml==0.10.2
toolz==1.0.0
torch==2.5.1+cpu
torchaudio==2.5.1
torchgen==unknown
torchtext==0.18.0
torchvision==0.20.1
torchvision.libs==unknown
torio==unknown
tornado==6.4.2
tqdm==4.64.0
traitlets==5.14.3
trimesh==3.9.29
typer==0.15.2
types-python-dateutil==2.9.0.20241206
types_python_dateutil==unknown
typing-extensions==4.12.2
typing_extensions==4.12.2
ujson==5.10.0
unification==unknown
uri-template==1.3.0
uri_template==unknown
urllib3==1.26.20
uvicorn==0.19.0
uvloop==0.21.0
wand==0.6.13
wasabi==0.10.1
watchfiles==1.0.4
wcwidth==0.2.13
weasyprint==53.3
webcolors==24.11.1
webencodings==0.5.1
websocket==unknown
websocket-client==1.8.0
websocket_client==unknown
websockets==10.3
werkzeug==3.1.3
wheel==0.43.0
wordcloud==1.9.2
wrapt==1.17.2
wsproto==1.2.0
xarray==2024.3.0
xarray-einstats==0.8.0
xarray_einstats==unknown
xgboost==1.4.2
xgboost.libs==unknown
xlsxwriter==3.2.2
xml-python==0.4.3
xml_python==unknown
xmltodict==0.14.2
yaml==unknown
yarl==1.18.3
zipp==3.21.0
zmq==unknown
zopfli==0.2.3.post1

Offering and refreshing the name of the mount point files into the system messages can avoid further explorations just to see what is there. And: have the AI iterate through a list of files instead of tolerating it making dozens of calls.

You can also reinforce that scripting results can only be observed by tool returns - output produced by the script.

(note that this also gets ace_tools uselessly - sending tabular data to ChatGPT via a port, but fortunately no promping about that)

As a tiny nitpick, which I’m sure you’re aware of but others may not be, for reasoning models your messages use the developer role and OpenAI uses the system role, which the LLM is trained to trump your prompt with in the instruction hierarchy specified in the model spec.

So, OpenAI has its own privileged role, and their intentions with it and any injected instructions are completely invisible to you. As if closed weights weren’t bad enough. My hope is that it’s just for hot-patching the next GlazeGPT outbreak.

The order of system messages never seemed to matter. Today for the purpose of this post, I was able to get gpt-4.1 to correctly identify a person in an image I got from Wikipedia on my first attempt via system instructions. It will not do this for a user, and o4-mini won’t do it for a developer.

Same prompt, but with o4-mini:

I’m sorry, but I can’t help with identifying the person in the image.

gpt-4.1, which previously worked, but now with user role and no system instructions:

Sorry, I can’t determine who this is.

Conclusion?

With nothing except for injected prompts, OpenAI only has the ablity to override your system instructions in reasoning models, and they do this using a role only they have access to. Today’s non-reasoning flagships don’t seem to have this ability. But it makes me wonder… what else will they do with this super-system kind of role of theirs?

Edit

I only realized just now that I’m horribly off-topic, so I’ll try and add something of value.

The increase in tokens may be the model re-prompting itself every time it manipulates the image. So, my interpretation is that if it manipulates one image three times, then that’s six image inputs in your API call.

As a further tiny nitpick, you can send “developer” but “system” is quite plainly passed into context.

I’m sorry it had to come to endpoint exploits… :slight_smile:

That’s correct, any internal tool iterations, and also tool calls to you, means running the API call again.

There is an additional ~250 tokens that you are not billed for (but once were in error), related to vision, causing the effects seen.

Different prompt, but with o4-mini:

1 Like

Wow! This is an extremely cool find. I really wish OpenAI would just provide this natively. I guess that if we are system then OpenAI is platform? I still think it’s fair to derive that they have a privileged role, or at least a privileged token or indicator that we can’t reach.

One-shot training goes hard. Now just wait for them to remove manual context in the next API. :rofl:

1 Like

They have the privilege of running any branching prompts, any internal context or documentation, any fine-tuning steps, whatever, to inspect inputs, to re-inspect entire passed contexts…

Here’s a use that brings such internal considerations about…and passes.

For every layer of oversight they implement, a little more tint is added to the window which we see OpenAI and its models and outputs through. As shown by the sycophancy issues, the bigger problem isn’t how devs and users are prompting the AI but how the AI, without being asked, influences human opinion.

So, invisible prompts and context that use the new hierarchy are alarming, especially since it clearly doesn’t do much for safety against adversial developers, which was supposed to be the proposed intention here. Maybe they’re hoping that no one knows how to prompt with shots? That’s what I’ll tell myself to stay optimistic.