After text-davinci-003 was cancelled, new model doesn't work

adharbertwork · February 27, 2024, 5:59pm

Since text-davinci-003 was deprecated, I’ve been trying to use gpt-3.5-turbo-instruct. We have been trying to use these models to summarize long transcriptions. the old model was working very well. We asked it to stay under 2k characters and keep the response in a 1st person conversation.

Now with gpt-3.5-turbo-instruct, it almost never keeps the character response under 2k. The sentence structure is terrible, startings with “oh, I” or “Well”. We never had these issues and everything I’ve tried has been terrible. This model ignores a lot of my commands and is very flakey.

We also used the old model to see if there was anything offensive with the summary, which worked great as well. Now it’s like Gemini, finds every little aspect offensive that would be marked as ultra woke (which I hate using that term) but this is what everyone is coming back with.

Sorry for the long text here, but I’ve tried contacting support and OpenAI has completely ignored our messages. If anyone can help, I would greatly appreciate it.

_j · February 27, 2024, 7:48pm

Here’s what I’ve found about gpt-3.5-turbo-instruct:

The “turbo” part of it means it is going to be a smaller model than 175b davinci was. The power of gpt-3.5-turbo to infer semi-sensible replies was all the chat training, but that’s been backed out.
We aren’t given what kind of separator or format would be ideal, so as far as “instruct” goes, we’re somewhat powerless and have to invent our own form to make the AI produce a response where we want it to, the ultimate meaning behind “prompt”: prompt the AI to write now like itself.
The AI still has very strong completion tendencies, indicating that the tuning on instructions has been given a light touch. You get a lot less instruction of what a summary is supposed to look like, and more completing the text in a way that looks like a document that you fool the AI into continuing to produce.

(besides, the AI can’t count or see characters, what might have been emergent in davinci is sparsified in turbo.)

I would migrate over to gpt-3.5-turbo-0613 (or its 16k cousin), where you can use a system prompt to clearly define a specialized role that the AI will serve, the type of input it will be receiving, and the style of response it can give. The data to be processed can be a user message with just a small introduction of the exact task to be performed.

Demonstration: Summarize an audio transcript from the GPT-4 developer livestream, March 2023.

Model: gpt-3.5-turbo-16k

System message:

You are a helpful writing assistant AI. You operate and process raw text from an audio transcript, and create a summary of the topic and the conversation flow with a target length of 500 words, in five 100-word paragraphs.

Input message: 4432 tokens

Summary

The foreign GPD4 developer demo live stream was conducted recently. Honestly, it’s kind of hard for me to believe that this day is here. OpenAI has been building this technology really since we started the company, but for the past two years, we’ve been really focused on delivering GPT4.

That started with rebuilding our entire training stack, actually training the model, and then seeing what it was capable of. We tried to figure out its capabilities, its risks, and worked with partners in order to test it in real-world scenarios. We really focused on tuning its behavior, optimizing the model, and getting it available so that you can use it.

So today, our goal is to show you a little bit of how to make GPT4 shine, how to really get the most out of it. We want to show you where its weaknesses are, where we’re still working on it, and just how to really use it as a good tool, a good partner.

If you’re interested in participating in the stream, you can go to our Discord. There, you’ll find comments and we’ll take a couple of audience suggestions.

The first thing I want to show you is the first task that GPD4 could do that we never really got 3.5 to do. Throughout training, you’re constantly doing all this work. It’s 2 A.M, the pager goes off, you fix the model, and you’re always wondering if it’s going to work. Is all this effort actually going to pan out?

We all had a pet task that we really liked and that we would all individually be trying to see if the model was capable of it. I’m going to show you the first one that we had a success for with GPD4, but never really got there for 3.5.

I’m just going to copy the top of our blog post from today and paste it into our Playground. This is our new chat completions playground that came out two weeks ago. I’m going to show you first with GPT 3.5. GPT4 has the same API to it, the same playground.

The way that it works is you have a system message where you explain to the model what it’s supposed to do. We’ve made these models very steerable so you can provide it with really any instruction you want, whatever you dream up, and the model will adhere to it pretty well. In the future, it will get increasingly powerful at steering the model very reliably.

You can then paste whatever you want as a user. The model will return messages as an assistant. We’re moving away from just raw text in, raw text out, where you can’t tell where different parts of the conversation come from, but towards this much more structured format that gives the model the opportunity to know well, this is the user asking me to do something that the developer didn’t attend. I should listen to the developer here.

Now, it’s time to actually show you the task that I’m referring to. Everyone’s familiar with “summarize this article into a sentence”. But let’s get a little more specific. Let’s say, summarize this article into a sentence where every word begins with G.

This is 3.5, let’s see what it does. Yeah, it kind of didn’t even try. It just gave up on the task. This is pretty typical for 3.5 trying to do this particular kind of task. If it’s a very stilted article or something like that, maybe it can succeed, but for the most part, 3.5 just gives up.

But let’s try the exact same prompt, the exact same system message, in GPT4. It’s kind of borderline whether you want to count AI or not, but let’s say AI doesn’t count, that’s cheating. So fair enough, the model happily accepts my feedback.

So now, to make sure it’s not just good for G’s, I’d like to turn this over to the audience. I’ll take a suggestion on what letter to try next. In the meanwhile, while I’m waiting for our moderators to pick the lucky letter, I will give a try with A. But in this case, I’ll say GPD4 is fine. Why not?

Also, it’s a pretty good summary. So I’ll hop over to our Discord. Wow, people are being a little ambitious here. They’re really trying to put the model through the paces. We’re going to try Q, which if you think about this for a moment, I want the audience to really think about how would you do a summary of this article that all starts with Q. It’s not easy.

It’s pretty good, that’s pretty good. All right, so I’ve shown you summarizing an existing article. I want to show you how you can flexibly combine ideas between different articles.

So I’m going to take this article that was on Hacker News yesterday, copy-paste it into the same conversation so it has all the context of what we’re just doing. I’m going to say, find one common theme between this article and the GPD4 blog.

This is an article about Pinecone, which is a Python web app development framework, and it’s making the technology more accessible, user-friendly. If you don’t think that was insightful enough, you can always give some feedback and say that was not insightful enough.

Please, no, I’ll just even just leave it there, leave it up to the model to decide. So, “Bridging the Gap between powerful technology and practical applications” seems not bad. And of course, you can ask for any other kind of task you want.

Using its flexible language understanding and synthesis, you can ask for something like, “Now turn the GT4 blog post into a rhyming poem.” It picked up on “OpenAI evalues open source for all, helping to guide, answering the call.”

By the way, if you’d like to contribute to this model, please give us evals. We have an open-source evaluation framework that will help us guide and all of our users understand what the model is capable of and to take it to the next level.

So there we go, this is consuming existing content using GPT4 with a little bit of creativity on top. But next, I want to show you how to build with GPT4, what it’s like to create with it as a partner.

And so the thing we’re going to do is we’re going to actually build a Discord bot. I’ll build it live and show you the process, show you debugging, show you what the model can do, where its limitations are, and how to work with them in order to achieve new heights.

The first thing I’ll do is tell the model that this time it’s supposed to be an AI programming assistant. Its job is to write things out in pseudocode first and then actually write the code. This approach is very helpful so that the model breaks down the problem into smaller pieces.

That way, you’re not asking it to just come up with a super hard solution to a problem all in one go. It also makes it very interpretable because you can see exactly what the model was thinking and you can even provide corrections if you’d like.

So here is the prompt that we’re going to ask it. This is the kind of thing that 3.5 would totally choke on if you’ve tried anything like it. But so we’re going to ask for a Discord bot that uses the GPD4 API to read images and texts.

Now there’s one problem here, which is this model’s training cutoff is in 2021, which means it has not seen our new chat completions format. So I literally just went to the blog post from two weeks ago, copy-pasted from the blog post, including the response format. It has not seen the new image extension to that, and so I just kind of wrote that up and you know, just very minimal detail about how to include images.

So now the model can actually leverage the documentation that it did not have memorized, that it does not know. In general, these models are very good at using information that it’s been trained on in new ways and synthesizing new content.

You can see that right here that it actually wrote an entirely new bot. Now let’s actually see if this bot is going to work in practice. You should always look through the code to get a sense of what it does. Don’t run untrusted code from humans or from AIs.

One thing to note is that the Discord API has changed a lot over time, and particularly that there’s one feature that has changed a lot since this model was trained. Let’s give it a try. In fact, yes, we are missing the intents keyword. This is something that came out in 2020. So the model does know it exists, but it doesn’t know which version of the Discord API we’re using. So are we out of luck?

Well, not quite. We can just simply paste to the model exactly the error message. We’re not even going to say, “Hey, this is from running your code, could you please fix it?” We’ll just let it run. And the model says, “Oh yeah, whoops, the intents argument. Here’s the correct code.”

Now let’s give this a try once again, kind of making sure that we understand what the code is doing. Now a second issue that can come up is it doesn’t know what environment I’m running in. And if you notice, it says, “Hey, here’s this inscrutable error message,” which if you’ve not used Jupyter notebook a lot with async IO before, you probably have no idea what this means.

But fortunately, once again, you can just sort of say to the model, “Hey, I am using Jupyter and would like to make this work. Can you fix it?” And the specific problem is that there’s already an event loop running, so you need to use this Nest async IO library. You need to call Nest async IO dot apply. The model knows all of this, correctly instantiates all of these pieces into the bot. It even helps, hopefully, tells you, “Oh, you’re running in Jupyter, well you can do this bang pip install in order to install the package if you don’t already have it.” That was very helpful.

So now we’ll run and it looks like something happened. So the first thing I’ll do is go over to our Discord and I will paste in a screenshot of our Discord itself. So remember, GPT4 is not just a language model, it’s also a vision model. In fact, it can flexibly accept inputs that intersperse images and text arbitrarily, kind of like a document.

Now the image feature is in preview, so this is going to be a little sneak peek. It’s not yet publicly available, it’s something we’re working with one partner called Be My Eyes in order to really start to develop it and get it ready for prime time. But you can ask anything you like. For example, I can’t, you know, I’ll say GPT4, hello world, can you describe this image in painstaking detail?

All right, which first of all, think of how you would do this yourself. There’s a lot of different things you could latch onto, a lot of different pieces of the system you could describe. And we can go over to the actual code and we can see that yep, we, in fact, received the message, have formatted an appropriate request for our API, and now we wait.

Because one of the things we have to do is we have to make the system faster. That’s one of the things that we’re working on optimizing. In the meanwhile, I just want to say to the audience that’s watching, we’ll take an audience request next. So if you have an image and a task you’d like to accomplish, please submit that to the Discord. Our moderators will pick one that we’ll run.

So we can see that the Discord, oh, it looks like we have a response. Perfect. So it’s a screenshot of a Discord application interface. Pretty good, did not even describe it, it knows that it’s Discord. It’s probably Discord written there somewhere, or it just kind of knows this from prior experience.

It describes the server icon label GPD4, describes the interface in great detail, talks about all the people telling me that I’m supposed to do Q, a very kind audience, and describes a bunch of the notification messages and the users that are in the channel. So there you go, that’s some pretty good understanding.

Now this next one, if you notice, first of all, we got a post, but the model did not actually see the message. So is this a failure of the model or of the system around the model? Well, we can take a look. And if you notice here, content is an empty string. We received a blank message content.

The reason for this is a dirty trick that we played on the AI. So if you go to the Discord documentation and you scroll through it all the way down to the message content intent, you’ll see this was added as of September 2022 as a required field. So in order to receive a message that does not explicitly tag you, you now have to include this new intent in your code.

Remember I said intents have changed a lot over time. This is much newer than the model is possibly able to know. So maybe we’re out of luck, we have to debug this by hand. But once again, we can try to use GPD4’s language understanding capabilities to solve this.

Now keep in mind, this is a document of like I think this is like ten thousand, fifteen thousand words, something like that. It’s not formatted very well. This is literally a command, a copy-paste. Like this is what it’s supposed to parse through to find in the middle of that document that oh yeah, message contents, that’s required now. But let’s see if it can do it.

So we will ask for, “I am receiving blank message contents. Why could this be happening? How do I fix it?” So one thing that’s new about GPD4 is context length. 32,000 tokens is kind of the upper limit that we support right now and the model is able to flexibly use long documents. It’s something we’re still optimizing so we recommend trying it out but not necessarily really scaling it up just yet unless you have an application that really benefits from it.

So if you’re really interested in long context, please let us know. We want to see what kinds of applications it unlocks. But if you see, it says, “Oh yeah, message content intent was not enabled.” And so you can either ask the model to write some code for you, or you could, I actually just, you know, do it the old-fashioned way. Either way is fine.

I think this is an augmenting tool, makes you much more productive, but it’s still important that you are in the driver’s seat and are the manager and knows what’s going on. So now we’re connected once again, and Boris, would you like to rerun the message? Once again, we can see that we have received it even though the bot was not explicitly tagged.

It seems like a pretty good description. Interesting, this is an interesting image. It actually looks like it’s a Dali generated one. And let’s actually try this one as well. So what’s funny about this image? Oh, it’s already been submitted.

So once again, we can verify this, making the right API calls. Squirrels do typically eat nuts. We don’t expect them to use a camera or act like a human. So I think that’s a pretty good explanation of why that image is funny.

So I’m going to show you one more example of what you can do with this model. So I have here a nice hand-drawn mock-up of a joke website. Definitely worthy of being put up on my refrigerator. So I’m just going to take out my phone, literally take a photo of this mock-up, and then I’m going to send it to our Discord.

All right, going to send it to our Discord. And this is of course the rockiest part, making sure that we actually send it to the right channel. Which in fact, I think maybe I did not, sent it to the wrong channel. It’s funny, it’s always the non-AI parts of these demos that are the hardest part to do. And here we go, technology is now solved.

And now we wait. So the thing that’s amazing in my mind is that what’s going on here is we’re talking to a neural network. And this neural network was trained to predict what comes next. It played this game of being shown a partial document and then predicted what comes next across an unimaginably large amount of content. And from there, it learns all of these skills that you can apply in all these very flexible ways.

And so we can actually take now this output. So literally, we just said to output the HTML from that picture. And here we go, actual working JavaScript, filled in the jokes. For comparison, this was the original of our mock-up. And so there you go, going from hand-drawn, beautiful art, if I do say so myself, to a working website.

And this is all just potential. You can see lots of different applications. We ourselves are still figuring out new ways to use this. So we’re going to work with our partner, we’re going to scale up from there, but please be patient because it’s going to take us some time to really make this available for everyone.

So I have one last thing to show you. I’ve shown you reading existing content. I’ve shown you how to build with the system as a partner. The last thing I’m going to show is how to work with the system to accomplish a task that none of us like to do but we all have to.

So you may have guessed, the thing we’re going to do is taxes. Now note that GPT is not a certified tax professional, nor am I, so you should always check with your tax advisor. But it can be helpful to understand some dense content to just be able to empower yourself to solve problems and get a handle on what’s happening when you could not otherwise.

So once again, I’ll do a system message. In this case, I’m going to tell it that it’s Tax GPT, which is not a specific thing that we’ve trained into this model. You can be very creative if you want with the system message to really get the model in the mood of what is your job, what are you supposed to do.

So I pasted in the tax code. This is about 16 pages worth of tax code. And there’s this question about Allison and Bob. They got married at one point, and here are their incomes. They take a standard deduction, they’re filing jointly. So first question, what is their standard deduction for 2018?

So while the model is chugging, I’m going to solve this problem by hand to show you what’s involved. So the standard deduction is the basic standard deduction plus the additional. The basic one is 200 percent for a joint return of subparagraph C, which is here. Okay, additional doesn’t apply, the limitation doesn’t apply.

Um, okay, now these apply. Oh wait, special rules for taxable year 2018, which is the one we care about, through 2025. You have to substitute twelve thousand for three thousand. So two hundred percent of twelve thousand, twenty-four thousand is the final answer.

If you notice, the model got to the same conclusion. And you can actually read through its explanation. And to tell you the truth, the first time I tried to approach this problem myself, I could not figure it out. I spent half an hour reading through the tax code, trying to figure out this back reference and why there’s some program, just what’s even going on.

It was only by asking the model to spell out its reasoning and then I followed along that I was like, “Oh, I get it now. I understand how this works.” And so that, I think, is where the power of the system lies. It’s not perfect, but neither are you. And together, it’s this amplifying tool that lets you just reach new heights.

And you can go further. You can say, “Okay, now calculate their total liability.” And here we go, it’s doing the calculation. Honestly, I, every time it does it, it’s just, it’s amazing. This model is so good at mental math. It’s way, way better than I am at mental math. It’s not hooked up to a calculator. Like that’s another way.

AI: Summarize the presentation session above, by Greg Brockman of OpenAI. Note that correction of transcription to “GPT-4” will be needed in many places.

Output, 1908 characters:

OpenAI recently conducted a live stream demo of GPT-4, their latest language model. The company has been working on developing this technology for the past two years, focusing on training the model, testing its capabilities, and optimizing its behavior. The goal of the demo was to showcase how to effectively use GPT-4 and highlight its strengths and weaknesses. The audience was invited to participate by providing suggestions and questions through Discord. The first task demonstrated was summarizing an article into a sentence where every word begins with the letter “G.” GPT-4 successfully completed this task, unlike its predecessor, GPT-3.5. The model was also shown to be capable of combining ideas from different articles and generating rhyming poems.

The demo then shifted to building a Discord bot using GPT-4 as an AI programming assistant. The model was instructed to write pseudocode and then generate the actual code, breaking down the problem into smaller pieces. The process involved debugging and addressing issues such as changes in the Discord API. GPT-4 proved to be helpful in providing solutions and even suggested installing necessary packages. The bot successfully read images and texts, demonstrating the model’s vision capabilities. The audience was encouraged to submit image-related tasks for the model to accomplish.

Taxes were used as an example to showcase GPT-4’s ability to understand complex content. The model was given a system message as “Tax GPT” and provided with a section of the tax code. It accurately answered questions about deductions and liabilities, demonstrating its comprehension and problem-solving skills. The model’s explanations helped users understand the reasoning behind its answers. The session concluded by highlighting the power of GPT-4 as an amplifying tool that can assist users in reaching new heights and understanding complex topics.

adharbertwork · February 27, 2024, 9:54pm

Thanks for the helpful information. I will try these out and see if I get any better results. I appreciate it!!

Topic		Replies	Views
GPT 4o mini performing much worse than GPT-3.5-16k Bugs	0	99	August 18, 2024
Issues with the new gpt-3.5-turbo-0125 after renaming to gpt-3.5-turbo Deprecations gpt-35-turbo	7	5245	July 1, 2024
Longer GPT 3.5-turbo Output Problem Feedback gpt-35 , gpt-35-turbo , fine-tuning	2	926	October 29, 2023
Gpt-3.5-turbo-0613 became useless API	9	2352	September 5, 2023
Gpt-3-5-turbo-1106 either timeout or gives radically different result from gpt-3.5-turbo-16k API gpt-35-turbo	9	3363	December 4, 2023

After text-davinci-003 was cancelled, new model doesn't work

Related Topics