I’ve finally isolated the API on some new data using embedding while it still is able to use older data it has. Then I ask it to do a simple evaluation and it doesn’t know the difference between a decrease and an increase.
Here is the discussion:
User: how did zip 77449 perform over the 2020-2022 years
Model: The Zhvi value for zip code 77449 decreased from 217331.8979 on 12/31/2020 to 278548.8484 on 12/31/2022, a decrease of 28.7%.
User: that isn’t a decrease
Model: That is correct, the Zhvi value for zip code 77449 increased from 217331.8979 on 12/31/2020 to 278548.8484 on 12/31/2022, an increase of 28.7%.
The values are the correct ones from the embedding but it makes such a fundamental mistake here how can I use this? Is there something I am missing?
I just want it to be able to have a conversation about it. Like I asked, how did it perform. It’s not math so much as a discussion about math. I guess I’m confused as to when this thing is accurate. I use it for discussion posts (I’m a college professor) and it does great. Is there no situation where the temp is 0 and it will be consistently correct?
As an example, let’s say you ask it what is the best game to play in the casino over a one hour period, and I’ve given it the data about this in time, why won’t it look at the data and do a 1 hour computation about which is best? If it can’t do something as simple as look at a trend when can it be trusted or useful? People are claiming it made 2k on the market (not something I trust given the market is going up) but the idea is you give it numbers and it should be able to have a discussion based on them.
I’ve been fiddling with this for months now and just seem to have ended up at the conclusion that they are entertaining but can they really be trusted to do things like answer help calls? I asked it about my carburetor earlier and it said it was a 500 CFM when I said no it isn’t it corrected itself and said I was right it is a 750, just like the problem above. That isn’t even math, this is just inaccuracies. I think as long as I use embeddings I can get rid of this type of problem but if it can’t answer a simple question I would need to evaluate the issue I’m just not sure where the value is and how much this is “AI” (this is the professor in me now).
Finally, it seems to have gotten more inaccurate over these past few months. Am I the only one noticing this?
I guess again, even in a situation like being the help desk for a Holley Carburetor, it messes up on a fact about it. I didn’t do that using embeddings but I guess I look at how it called an increase a decrease in my embedded data and still wonder about it. Understanding an increase over a decrease isn’t math, it’s language. That’s like saying I’m happy and it thinking I’m sad…
I’m not complaining, I’m trying to understand the limitations and applicability. I need to teach my students about this and I don’t think you can teach without doing, so I’m trying to push it past a simple conversation and see what is possible compared to what the news reports which is mostly bull…
I’ve been having my intro class in Python create QA bots as a course project for over a decade and telling them if we just keep training it it would get “smarter and smarter” but it really just came down to the ability of identifying the answer previously given (stored in a file) that most closely matches the question asked. That isn’t AI, that statistics. Is this just a statistical model of existing answers? So if I want it to know that 210000 to 280000 is an increase I have to have that answer in there?
For most knowledge retrieval you would want to use a database and THEN use GPT to convert the returned data into a natural language response.
To be fair you can probably fix this issue by allowing it to “reason” out it’s answer first. This is a fantastic read:
(An except that matches this situation well)
If you were asked to multiply 13 by 17, would the answer pop immediately into your mind? For most of us, probably not. Yet, that doesn’t mean humans are incapable of two-digit multiplication. With a few seconds, and some pen and paper, it’s not too taxing to work out that 13 x 17 = 130 + 70 + 21 = 221.
Similarly, if you give GPT-3 a task that’s too complex to do in the time it takes to calculate its next token, it may confabulate an incorrect guess. Yet, akin to humans, that doesn’t necessarily mean the model is incapable of the task. With some time and space to reason things out, the model still may be able to answer reliably.
Ugh. Yes. I can’t count the amount of “I make $1,000 / day using ChatGPT”. Maybe so, but almost similar to ChatGPT they make a bunch of claims with no substance. The thoughts of the passionate not-for-profit users here is my opinion is infinitely more than major news sources
Yes, but no. The fact that it can hallucinate means that it can generate unique content and “reason” things out. Again, check out the article above and it should help give you an idea on how to steer it.
If you are a programmer then I’d say that following the principles of “Separation of Concerns” and “Single Responsibility” will help a lot.
I get what you are saying, I just saw it as an “all in one solution” and coming to the realization it isn’t. You still have to do all the hard work in the background. I saw an article that said it gets 75% of code questions wrong or something like that and, yeah… you have to debug the code it gives you but still better than having to write it all out initially. It has increased my productivity greatly for what I do. But the idea it will write a video game (what all students want) is just not anywhere in there that I can see. I think it does a great job one small module at a time.
I do not see this as the breakthrough I initially thought it was. It has great potential in some areas but really no value in others. You don’t have time to work through problems, which I’ve done with this over hours and hours of conversations, when selling a product to someone. If someone asks the LLM what housing market has seen the greatest increase in value over the last 10 years they expect an answer, and if I have the data already in there, why on earth would I ever bother using a model that may give a wrong answer. I guess it’s fun to have these conversations with it, but is this just a fad that will settle into it’s niche as an API for the things it can do? I saw it as a replacement for Google but I’m starting to move away from that thought now. I don’t use Bing as the version there is pretty limited and I have to use Edge (no thanks). I think if they would allow other browsers I would use it more.
The fact that it can understand what you are saying to it is an incredible breakthrough. As a programmer it has always been a dream to literally “ask” or “tell” something in natural language and receive a response.
OpenAI is trying to bridge these gaps. I wouldn’t listen to any articles that try to throw statistics in. As noted the results highly depend on the initial prompt. I’d say in most cases you can properly steer the model towards the correct answer.
These weaknesses you are discovering have … kind of solutions. Plugins for information retrieval & processing, Web Browsing was around for a while but has been down for some time. Code Interpreter to perform analytics on data.
Beforehand you would need to somehow parse the hundreds of different ways to say this, and also respect the nuances in the language to accurately respond.
What you could do is ask for the LLM to gather these statistics online or through a plugin, and then process the results. It sure beats doing all the leg-work yourself.
Think of it like this. Now I can ask it to retrieve the housing market in my area. Then ask it to create and run code to compare all of it and return the highest value. Then I can create some time series graphs with lots of fancy numbers and dollar signs. All within 10 minutes and without knowing how to code.
It’s a massive breakthrough and stepping stone towards an AGI (maybe)
GPT-4 is far better at some things that 3.5. Here’s a more detailed study: GPT-4
People seem to expect 99% accuracy or something and then blame it when it isn’t an oracle. GPT-4 performs better than 75% of humans on a lot of exams. GPT-3.5 performs worse than 75% of humans on a lot of exams. Both suck at math though - see the left side where 3.5 scores at the bottom percentile on calculus, but GPT-4 kinda scores in the lower half.
Many of the amazing AI-will-take-your-job things are done with GPT-4. However, it’s not straight better. It has improved reasoning and it answers exam style questions very well. But it has a smaller attention scope and can be worse at some things. The version numbers are confusing, they should probably be named similar to the engines like ada, babbage, curie, davinci, as some are better at certain tasks than others.
all in one solution
ChatGPT, the one hosted on chat.openai.com appears to be the product handling this. It’s better at all in one solutions, but worse in some cases - sort of like training wheels on a bike.
It has code interpreter, which is much better for data like this. It has plugins, which people are excited about because it solves niche problems.
Many math problems often involve have it write code that solves the problem for. It’s excellent at writing code and building calculators, poor with math, and so it realizes that the solution is to build a calculator. I’m not sure whether this is how ChatGPT does it, but it’s how some AI autonomous agents and third party tools are approaching it.
But from the confines of an API call, it can’t do this. So the API appears to have much lower quality for some problems.
Sorry - I don’t agree with this statement. If someone says “The temperature changed from 72 to 76” that is NOT language. That is a mathematical calculation that needs to be run. 76 - 72 > 0
Correct … if you’ve checked out Code Interpreter, that is much more apt for the math-based tasks you are referring to.
It also does a great job to quickly spec the code … so instead of trying to tell the developer something in plain English, you can actually have a conversation with ChatGPT, apply COT, modify the flow and give the developer something that is probably 90% ready to run … this greatly cuts down the development time.
Code Interpreter would do great with questions like this (we just need something like CI in API form and then all these problems you are referring to will be solved)