Dear OpenAI Staff:
~~ ~~ ~~ ~~
This post discusses GPT-3’s ability to solve math questions. A detailed analysis is being performed regarding a previously conducted study. With this post, I look at how the study was conducted, provide my insight, and discuss it with others to get varying opinions and views.
This post contains 4 (pretty cool) screenshots of GPT-3 (Davinci) doing basic algebra attributed to me and 1 screenshot by Qasim Munye. I reviewed what’s allowed regarding posting screenshots on social media but I’m thinking this is different since it is within the OpenAI Community where this topic is freely discussed and talked about. Simply let me know if there are any changes that need to be changed to the post!
-Nick
~~ ~~ ~~ ~~
Prologue: I’m fairly new here!
~~ ~~ ~~ ~~
Hello world!! I hit the ground running with an unavoidable smile after completely being taken back by everyone’s AWESOME energy on here and all the MINDBLOWING projects being showcased and demo’ed. Seems like every project has a creativity level over 9000+. Although I may not get as creative as some of y’all, I seriously do hope to be a good contributor on here. I have lots to learn, much to teach, with the ultimate goal that I help as many people as I can.
I totally forgot to introduce myself to the OpenAI Community but I can get that done here really quick. My name is Nicholas Hickam (DutytoDevelop). I started my career & hobby at 11 when I saw people botting on Runescape, where a program plays Runescape for you with total autonomy, saving you from exerting energy on repetitive tasks. What most users saw as just another script to help them advance in a game, I saw building blocks to create real-world systems that automate and control anything within reach. In addition to programming, I like using mathematics to describe how systems work, from how energy flows all throughout the Universe to predicting the behavior of any system using mathematics.
I’ve since acquired a Computer Science Bachelor’s degree with a minor in Mathematics, and like any new guy on the block, I’m constantly searching for people and places where we can solve huge problems and improve humanity. I like to believe that somewhere out there is an amazing team that helps solves problems that are well beyond the scope of humankind like superheros do!
~~ ~~ ~~ ~~
Overview:
~~ ~~ ~~ ~~
Alright, if any of the content below is inaccurate, biased, or just slightly off base, let me know so I can do what I need to to fix any misconceptions. I am not an A.I expert, but I will say that after getting acquinted with GPT-3 and all the amazing things it is capable of, I figured I’d make a catchy, informative post that looks into a study that was performed that really just took a slap at GPT-3 without much constructive critism. I’m no expert, but I think I see a learning experience looking into this study either way.
After some digging around on the Internet, I started coming across articles saying that GPT-3 isn’t good at math, which was quite surprising considering it was able to correctly identify that a 10yo boy has asthma given a brief list of symptoms. Only medical professionals would have been able to correctly diagnosis this. Here’s the screenshot of GPT-3’s very accurate medical diagnosis:
Image courtesy of Qasim Munye, Twitter (Source). [Language model settings not provided.]
With the conflicting reports regarding GPT-3’s capabilities, I want to address the confusion I have and let this post be insightful to others who are also in the midst of learning about GPT-3. I came across a study that concluded that GPT-3 isn’t good at math, and before we jump to conclusions about what GPT-3 can and can’t do, I’d like to point out that GPT-3 picks up quite a lot with little training data, and so my personal stance regarding GPT-3 having poor math skills is that GPT-3 wasn’t taught correctly during the study. Also, I apologize if this post seems rather spontaneous, but I personally believe the study jumped to conclusions for the most part and if what I’ve seen matches up then there’s certainly room to debate here.
I don’t have as much training data as the study in question, but the quality of the data I’ve looked over can certainly help the arguments I’ll present. This entire post aims to not only understand GPT-3 better, but also allow us to help disperse misconceptions found on the Internet by sharing our own findings. That way, we see what came from this post and the importance it has in the future!
Action Time:
~~ ~~ ~~ ~~
Alright, the study conducted by researchers used a dataset consisting of 12,500 problems from high school math competitions. Those problems vary in difficulty and each problem can be categorized as one of seven general areas of mathematics (geometry, algebra, calculus, statistics, linear algebra, and number theory).
The researchers allowed GPT-3 “scratch space” so that it could show its work before arriving at the final answer, or it could simply provide a solution, and in both scenarios, GPT-3 was limited in training time but was allowed more training time on later runs. The study concluded that GPT-3 was only about 25% accurate at best (when no “scratch space” was used to show work and more training time was allotted on the dataset).
Now, 25% sounds quite low to me, so I wanted to get an idea as to how GPT-3 performs when answering algebra questions I gave it and what methods help it learn. The questions are expressed not only mathematically with symbols and numerical representation of numbers, but also through natural language where the math problem looks more like a word problem.
Here’s GPT-3 demonstrating substitution of variables to answer algebraic questions:
Given 3 algebraic sample problems, we see that GPT-3 was able to correctly answer the 4th problem with ease.
This second run consisted of 13 questions, 5 containing only 1 algebraic variable and 8 containing 2 algebraic variables.
Given enough sample problems, we see GPT-3 demonstrating the ability to assign a numerical value to a corresponding symbol and then solving the algebraic equation given through substitution. We simply gave GPT-3 enough sample problems where it can identify what the question is asking given the symbols have meaning within any equation. After reviewing the spectrum of probability, I do see that it has learned to assign values to x and y and solve the equation.
Now, although I did succeed in teaching GPT-3 basic algebra, there are a few caveats keeping GPT-3 from working alongside gifted math prodigy Terence Tao anytime soon. When you look at the screenshots above, you’ll see that the numbers I used were low and didn’t get too high above the double-digit values. This is because GPT-3 saw any number that was double-digit and beyond as something other than just the numerical representation of a said number. It’s not a good start for proving my argument, but that’ll change.
During the development of this very post, I decided that I should try expressing the problems differently so that the model is provided the same data but having the data expressed differently to see how the model performs. When phrasing the algebra problems such that they’re expressed through natural language, GPT-3 was not only handling large numbers with ease, but it seems to pick up on everything faster while being less error-prone at the same time. Looking back now, we may find the algebra problems with symbols easier to read, but we can’t ignore the best way to interact with GPT-3. I’m on the dictionary meaning of a word GPT-3 is a language model, that communicating with GPT-3 seems to be most optimal when the data is expressed with natural language. This can be done by simply rephrasing any question to be expressed as a word problem. Not only is this applicable in Mathematics, but applications, where GPT-3 could help with Chemistry and Biology problems, may have better chances right off the bat by ensuring communication is as close as natural-language as possible to improve GPT-3’s chances that it understands and applies the information given to it:
When the same questions are formatted differently, matching GPT-3’s data that it was trained on, we see a deeper understanding of algebraic functions by GPT-3:
Expressing the algebra problems through standard form instead of numerics allowed GPT-3 to better understand what was being asked of it and did very well at applying addition to large numbers despite them being written out.
Second screenshot showcasing GPT-3 answering algebra questions that it couldn’t answer before when questions are expressed with natural-language.
It seems like we’ve done a few things to help GPT-3 not only understand algebra problems but also learn how to solve them confidently too. We see that given enough well-defined sample problems, it is most certainly possible to train GPT-3 to get the output you desire. The biggest factor I see here is getting enough optimal training data for GPT-3 that builds upon ideas to slowly develop an understanding of math of various complexities.
Let’s address the elephant in the room here: Using a language model such as GPT-3 to perform math problems won’t be as effective as other models that are geared towards answering math problems, because, very simply, it was designed to be given text like a phrase or a sentence and return a text completion in natural language as stated by OpenAI. Mathematics is a science, conveyed through a symbolic language but is not the standard language that GPT-3 was trained on. What I want to convey here, in this analysis, is whether or not GPT-3 can then apply the training data to effectively show its understanding of algebra and show that the biggest reasons the study got poor results, from my own understanding, is due to the formatting of the data the model was trained on.
What I see here is potential, lots of potential in fact. With that, I’d like to return back to the study that was conducted. Taking a quick peek at the dataset provided to GPT-3 by the university shows they encountered the same issue I ran into when training GPT-3, and although it’s not devastatingly bad, the model has far less confidence and accuracy when it doesn’t learn correctly:
I’ve highlighted the problematic areas for GPT-3 when it goes over these questions. You can see that numerical values and symbols are used quite often, and like I’ve said before, training with data that includes mathematical symbols and numerical values will work in some cases, but by rephrasing the question in standard form, and replacing the equal symbol with the word “equivalent”, you’ll notice improvement but only if you’ve properly trained the model to understand probability and all the other topics that the model was pre-trained correctly on the Khan Academy dataset.
Conclusion:
~~ ~~ ~~ ~~
Am I on the right track in that study that did not appear to train the GPT-3 correctly in order to properly answer the questions? What have y’all done to help teach GPT-3 new concepts and ideas? I didn’t even need to fine-tune GPT-3 to allow it to quickly learn basic algebra, but would fine-tuning GPT-3 really give me that much more control over the training process? I’m sure I could teach it other mathematical concepts with the right data and time.
What is your view on this study and how I approached it? What observations were spot-on, on the right track, or mostly wrong so I can improve? I realize that there wasn’t much training data on my end, but the fact is that we did teach GPT-3 algebra, and I can only assume if GPT-3 is properly trained that it can also learn other mathematical concepts.
~~ ~~ ~~ ~~
Summary:
- Since this is my first time really contributing here, I would appreciate constructive criticism!
- When fine-tuning a model, is too much training data a potential risk even if the data is well-defined?
- Thank you for taking the time to read over this post. I put a lot of effort into this post and would like to improve it if possible.
- P.S. For those that already were notified of this post several hours ago, I sincerely apologize. It was nowhere near complete and there didn’t seem to be a way to delete it entirely so I’m now reposting.
- P.S.S. Cool links to y’all may like (Actually free programming books & a comprehensive Python cheatsheet)
Don’t blame GPT-3, blame the teacher (not the creators, just to be clear)!