Hey everyone! I’ve been doing some testing on a prototype I’m building with gpt-4o, and I’m a little bit puzzled. In some of the model’s responses, it’s doing some pretty impressive number crunching (for an LLM). E.g., it correctly calculated e^-1/3 to five decimal places. I’d love to get an explanation for how it’s doing this. Here are my best hypotheses right now:
The model has somehow memorized a whole bunch of common expressions like the one I’ve shared,
more advanced arithmetic capabilities have somehow been “baked” into the latest version of gpt-4o (unclear how), or
the model is making a secret tool call, e.g. to an environment that can execute Python, to perform the calculation.
Does anyone have insight into which of these is the most likely—and more generally, how the model is doing this?
Partly, I’m asking because if (3) is true, is it even necessary for developers to implement their own execute-Python tool anymore (as a workaround for LLMs’ generally poor arithmetic abilities)?
I recall when GPT-3 was first launched. Back then, I reported a few incorrect answers—mostly calculation errors—to OpenAI, and they responded impressively fast, within about 10 minutes, and fixed the issues. This makes me suspect there’s an efficient method to quickly update or hardcode corrections into the system’s knowledge base, possibly through something like a vector store.
The system also appears capable of leveraging a Python interpreter to handle calculations, which likely helps ensure accuracy in mathematical operations.
The system also appears capable of leveraging a Python interpreter to handle calculations, which likely helps ensure accuracy in mathematical operations.
I’m definitely aware of this in the ChatGPT product! The Python interpreter appears to be just one of several tools it has, along with web search, image generation, etc. What I’m wondering is if the Python interpreter has somehow also been slipped into the API (and a quick note: I’m referring to the raw API here, not the assistants API). This would be very surprising to me, because as an API consumer, I like to think that I have full control over the tools parameter, the system message, etc., so it would be weird if OpenAI appended a new element to the tools array without telling me (and without charging for those tokens). And yet, the math abilities are hard to explain otherwise. Definitely curious for any other insights here!
Thinking some more, I feel like the answer has to be (1). The latest release of gpt-4o is probably just trained/fine-tuned on a healthy-sized dataset of math questions.