Has the mathematics ability of chatgpt recently been upgraded?

I’m wondering if there has been a recent upgrade in chatgpt’s math ability and in what was done. More specifically, I’m wondering if there are any details about how it does math? Maybe a blog post or some public documentation?

I keep seeing claims that it is terrible at math, and it definitely used to be, but something seems different now where it now appears capable of most textbook undergraduate level math, even proofs. Of course it still makes mistakes but it is behaving very differently than only a few months ago.

I’m wondering if it isn’t just an LLM but also employs other types of algorithms? E.g. surely it has access to a basic numerical calculator algorithm, yes? Or does it actually add and subtract numbers purely using a text prediction algorithm?

Any details or links are appreciated.

1 Like

Welcome to the forum!

No.

LLMs can be asked to create software/code that can do the calculations but they simulate doing the math. Years ago before Chain of Thought, most of the math answers were more likely to be correct if they happen to be trained on the simple math expressions such as add 2 two digit numbers, but beyond that it became increasing wrong.

When more users started using Chain of Thought the answers seemed to improve, yet again there were limitations and just going to 6 digit numbers might produce errors.

Lately some models are deciding to switch to a reasoning mode if needed, which could account for why you are seeing better results.

One of the best places to find such information is in the system cards for a LLM, e.g.

GPT-4o System Card

Each model will have a different system card but they may not be published. You will have to search for them as there is no one place to find them all.

Also look for prompting guidelines if they exist, e.g.

GPT-4.1 Prompting Guide

So you are saying that it definitely does not have access to computing with Python at all, nor to Python’s symbolic engine? It says it does.

It just did a very complicated integral that Grok could not. The latter was like gpt3 nonsense. WolframAlpha could only do a numerical approximation. Chatgpt even have s perfect step by step explanation using identities etc.

Does the ChatBot (ChatGPT) say that or is that from documenation given by OpenAI?

If the ChatBot is saying that then welcome to the world of hallucinations.

Odds are that is chain of thought or reasoning in action.

Care to share the conversation so we can see the details.

In you ChatGPT settings, you can enable the option “Always show code when using data analyst” to make it more clear if it is using the python code interpreter. Otherwise, it may be subtle hiding under some tiny “click to see…” link either in conversation or thinking process.

You can also refer it as “python tool” to explicitly tell it to use the code interpreter to solve your question.

Like: “please use the python tool to calculate the possible combinations for the numbers 1,2,3 in a 5 digit sequence”.

If you used a reasoning model, you need to click on the “Thought for 24 seconds…” to see the reasoning.

The reason for asking for a share of the conversation is to see what you are seeing.

Without that I can only guess at what it is doing, which model is used, which options are set, etc.

I did check the system cards that I can find, they note the models have improved in math and reasoning with out more specifics such as there may be an integrated calculator.

If you have Codex, which I do not have, as noted

each task runs in its own cloud sandbox environment

I’m not allowed to post links it seems. Here is s kind screenshot of the chat:

I pasted it together in photo software, so that’s why it might look a bit chopped up.

It wasn’t a truly hard integral, as standard tricks worked, but it was highly nontrivial. I’m just really curious about this. I don’t really see how problematic text prediction alone is sufficient. However, I know little about AI algorithms. I am a probability theory researcher though, so I can understand the ideas to some degree.

One of the reasons I’m asking about this is so I can figure out how best to advise my students in use of AI for learning math. It’s good enough to help me learn more advanced graduate/research level stuff apparently. Of course it still makes mistakes, but that’s ok. I didn’t advise anyone to blindly use it. I don’t even advise blindly trusting calculators or computers.

1 Like

New users are restricted from posting such. Just put ` around it then a moderator can fix it, e.g.

With `

https://chatgpt.com/share/68388f15-0b6c-8013-b787-f92cefa56ab8

Without `

I tried to recreate your prompt. Since the prompt was cut off I pasted the image of the expression, looks like ChatGPT converted the image correctly but gave a final different answer.


This is not a typical question on this forum.

While technically not the type of questions that should be asked on the Lean forum, many math professors are on the forum. I am not saying you should ask there, but then again I can’t stop you.

For me at least, if I don’t know how to verify an answer an LLM gives me, then I can not trust it. For example I like learning about quantum mechanics but can not do it and so can not verify it and thus can not trust anything ChatGPT tells me. However for programming I know it well and can quickly know if what it is noting is real or not and even check it.

I can’t really say I agree with that.


Here is another reason not to trust LLM replies.


One last point worth nothing. Many programmers are using LLMs to generate code and getting excellent results, however some have been doing it for a few months now and note online that there ability to think through solving some problems is now harder because they are relying on the AI to much.


Almost forgot

https://openai.com/index/teaching-with-ai/

HTH

This is fascinating as 1/2 is the same answer gave by Grok. ChatGPT is also getting it wrong right now for me too. I’m fairly certain that the computation I shared by ChaGPT is the correct one (with the answer being 1/4) as it aligns with WolframAlpha and some other higher level theory about the geometric mean of the function and its relationship to determinants of matrices created form the Fourier coefficients and some other probability theory stuff.

Claude 4 Sonnet just reproduced the same initial solution given by ChatGPT. As far as I can tell, that computation resulting in 1/4 is correct. Grok just did another nonsensical answer to it.

I guess it’s just still very unpredictable.

I didn’t mean for the forum to give me advice on teaching, but the reason I’m investigating the AIs math ability is so that I can give my students advice on its use.

I totally get the warning about relying on it too much for thinking. I was using it to generate some complicated code, and it was kind of boring as I missed the exhilaration of thinking through it more carefully. I also didn’t feel as motivated to carefully check code that I didn’t create myself.

Thanks for the help here and for providing some background and thoughts. I’ll continue to try and find more detailed information about how ChatGPT does math.

1 Like

No need to follow up here. I’m just posting more information for any interested readers that happen to see this.

I asked ChatGPT to solve the same integral multiple times. Sometimes it got it right and sometimes wrong. Here is an example prompt of the precise integral:

Compute approximately and exactly:

$$\exp\left(\int_0^1 \log\left(\frac12+\frac12\cos(2\pi x)\right) dx\right)$$

I was perplexed by this. I pressed it for more details and sources for the identities it was using.

The correct answer to this exponentiation of the integral is 1/4 exactly.

I did the hard work of computing it by hand manually using some results I found on Math Stack Exchange. It was tedious, but I verified the result. I then asked ChatGPT to provide a reference for the integral identity it used and it actually provided 2 references. I verified one of them but not the other.

It’s still quite amazing that it is able to put together these kinds of tricky mathematical things. I just wish I understood more about why it gets it correct sometimes and wrong others.

It uses a higher temperature by default, which makes responses vary to be more creative. Perhaps some variations are correct, others not in this case.

I got varying results from 0, 0.5 and 0.25.

Then setting temperature to 0.0 seems to go by what is the expected value of 0.25 (I’m not a mathematician).

Set temperature to 0.0 for this prompt.

Compute approximately and exactly:

$$\exp\left(\int_0^1 \log\left(\frac12+\frac12\cos(2\pi x)\right) dx\right)$$

While it seems to improve the results, it didn’t use code interpreter. It would require a larger amount of samples and testing to evaluate if the natural reasoning is actually correct or it learned from training data (overfit).

One important thing when you share this, is to inform what model you used. In this case I used ChatGPT 4o, the non reasoning model. Reasoning models like o4-mini and o3 may provide better, more consistent results.

1 Like

Thanks for that tip. I didn’t know about setting the temperature (though I had heard of that before). I also don’t understand the differences between the different models. So I’ll spend some time trying to learn the different settings options etc.

1 Like

This topic was automatically closed after 19 hours. New replies are no longer allowed.

Ran across this paper (06/02/2025) related to this topic.

“ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark” by Michael Shalyt, Rotem Elimelech and Ido Kaminer (PDF)

1 Like

Thanks for this! I’m glad now to at least have some references which test various math abilities. My takeaway from that paper is that ChatGPT should be quite reliable on “simple” problems—e.g. any standard homework problem from a calculus textbook—which use standard formulas and techniques (even somewhat obscure ones, so long as there is a reference for it in the training data). But if you give it a perturbed problem (with many numbers, parameters, and symbols), then it is more likely to fail (even though it might produce output which it confidently claims as correct). Again, 2 years ago, this was not the case, as it failed miserably even at such basic math.

I still wish I had some documentation which clarifies the degree to which ChatGPT performs actual mathematical operations (arithmetic, symbolic manipulation) as opposed to just generative text with pattern-matching. That being said, I’m not even sure what is the correct way to formulate the question or to even just conceptualize it. As to what degree symbolic manipulation is or isn’t just pattern-matching, I am more confused now than ever! At what point do I say such a machine is doing actual symbolic manipulation according to the accepted rules, I am not sure…

I’ve continued doing some testing and am very impressed with the results (even though it does make errors still). I’ve spent time testing various other online AIs as well (Grok, Thetawise, Gemini, Claude), and they all perform similarly and seem to all use actual computational tools like Python (though I cannot be sure as to what is going on under the hood, so-to-speak).

1 Like

The part that caught my attention in section 4

Until recently, the hybrid LLM+CAS approach appeared to be the most promising path forward. However, the surprising finding that frontier models no longer benefit from CAS use for symbolic math triggers deeper and more fascinating possibilities.

For me, using LLM + CAS was never in question, now it seems that LLMs alone might be enough, still seek hard evidence and repeated success before making the change.

Yes, I was particularly blown away when ChatGPT easily solved an integral that WolframAlpha couldn’t even understand. I envision future systems having various tools all integrated together (LLM, CAS, etc.).

This breeds excitement (and a little fear) about the future of this technology!

1 Like

Another paper related to this topic was recently posted (06/10/2025)

“CALT: A Library for Computer Algebra with Transformer” by Hiroshi Kera, Shun Arakawa and Yuta Sato (PDF)

Given a sufficient number of examples of symbolic expressions before and after the target computation, Transformer models—highly effective learners of sequence-to sequence functions—can be trained to emulate the computation.


FYI

How to find these papers soon after posting is by reading this ArXiv filter daily.