I’ve been working on this project on and off (in my free time) for 2 months now and I’m starting to feel like its just been a waste and the project is a non-starter.
I spend ages crafting my prompts, tweaking them to get the exact desired behaviour, looking at examples in the documentation etc, I think I’ve got it right and then I’ll come back a few days or weeks later and its behaving in a completely different way.
My project is a murder mystery game where the AI plays all the witnesses and suspects and the player is the detective.
I put in a bunch of 2 shot prompts for each character along with their personality and backstory and the setting of the game. I was getting pretty good responses from the characters but I upgraded to gpt-3.5-turbo-0613 for the functions and now the characters are not adhering to the 2 shots as closely, their responses are longer and more formal/stiff and one character in particular seems to be confused about who she is. eg. She is Octavia and she gave a response like “The garden party that Octavia hosted was meant to be a delightful and enjoyable event. It was held in the beautiful gardens of the manor house, with guests from the village and beyond in attendance. The atmosphere was festive, with lively conversations, laughter, and the tinkling of glasses.”
Obviously that’s not correct. In an earlier conversation, she said that she made some herbal medicine and was an expert in poisons - but that’s not true, that’s her daughter.
I did manage to get it so that it was correctly requesting function parameters from the user but today its gone back to just making up the answers (while sometimes also asking the user but its already triggered the function call with its made up answers)
Other functions where I’ve requested a response in particular format, one day they’re working, I come back to it and it does give the format requested.
They also get details just factually wrong even though they’ve been given the correct information in the prompt.
I just don’t know how to overcome these problems and I don’t know how other people are seemingly making commercially viable products when the responses can’t be relied on.
There are multiple services offering AI characters for games etc but that was more expensive and less flexible than developing my own solution and I don’t see how they’re not having the same problems
When you upgrade to a new model you may need to expect changes.
Why did you even upgrade?
Hey there @hazel1! I feel your frustration. I’ve spent a lot of time on a similar use case.
First, try not using the 0613 revision of the
gpt-3.5-turbo model. Try using the following parameters for your request in the chat completions API.
"temperature" : 0.0,
"n" : 1,
"top_p" : 0.0,
"frequency_penalty" : 0.0,
"presence_penalty" : 0.0
Max tokens can also be a big issue if you’re expecting shorter, less verbose responses. Please provide more examples of your request/response/expected/actual in this thread and I’ll try to help you more.
As I said in my post, I upgraded for the functions. Those (when they work), solved a different problem that I was wrestling with .
But my issues are not entirely confined to the upgrade - I upgraded as soon as it was available. But the inconsistencies have been a problem all along. You think you have the prompt perfected and it seems to work and its doing what you tell it and then when you come back to it, it doesn’t.
I was using a max_tokens of 256 and that had been enough until this week and then it started providing longer responses and the request comes back with a finish_reason of “length” so I get an incomplete response.
You’re using a temperature of 0? I’ve been using 0.7, for the chat responses (I use 0 on completions which I’m using to evaluate statements and 0.5 for “search warrants” and “forensic tests”).
I don’t want it to be completely rigid and just follow my script to the letter. I want to to be free to alter the response to fit the actual question and the context. With the new model, even at 0.7, I’m finding the responses to not feel as fluid and chatty and use repetitive phrasing
You should try upgrading to
gpt-3.5-turbo-16k, which does support functions.
For functions, I think the 0 temperature is necessary for now. Otherwise, I would go with what feels right for your responses.
The character that has functions attached to him has a temperature of 0.5 I supposed I could try lowering it (I wont know if what the user is asking for will trigger a function though before I send the response), so what the user asks for might result in a function call or it might just be another chat
Changing back to gpt-3.5-turbo or to 16k would only help for he next 4 days at which point gpt-3.5-turbo becomes gpt-3.5-turbo-0613
I really don’t think the 0613 snapshot is ready for production. I wouldn’t rely on that deadline. They’ll make sure all the bugs are sorted.
By the way, if you do have any questions at all on this use case or if you need some prompt advice, I’ve got a huge collection of prompts for chat completions for this use case that I can help you with. I’ve spent a lot of time on this one!
Aside from adding extra items, the completions work exactly as I want (and they’re not on gpt anyway so not affected by the update I don’t think). Its just so frustrating when you think you’ve found the perfect phrasing to get it to do what you want and only what you want, you test it several times and its perfect every time, then the next day when you get back to work, it’s not working. Yesterday I sent 30 requests and only 2 came back with additional list items instead of the requested JSON. Today 24 requests and only 3 came back correct.
I was really excited when I read that the updated version would be paying stronger attention to the system message but so far its been disappointing
I agree with OP, it just isn’t usable and I’m starting to doubt it’ll ever be. GPT4 works for now, but it’s very expensive and I’m starting to believe that it will suffer the same fate as GPT3.5.
It doesn’t matter what settings or prompt you use, when you attempt to do anything remotely creative, it will NOT follow the prompt and output laughably verbose, empty and formal text instead. I’ve tried so many settings, so many prompts, it literally doesn’t matter, it’s like the model is hard-coded to follow the style that OAI has instructed it to follow, and it’s also much less intelligent, so it will not understand context, which is required when you’re using it to help with translations and writing like I am.
I actually felt that they had improved both the new turbo and the 16k models a couple of days ago and they worked better for a while, but today they’re back to exhibiting the exact same behavior they did on launch, that’s proof that the problem isn’t the prompts or the settings, it’s on OpenAI’s side.
I really want to love their service but it’s impossible for me now, we’re making steps back instead of forward.
I’m just doing a few tasks on GPT4 for now but the money adds up fast. I’m kind of gutted that we’re just supposed to accept an inferior model and that’s all there is.
I found that it takes more efford in prompting to get same results.
Maybe you need some external help.
That doesn’t explain why a prompt works perfectly one day and then not the next
Does it work right multiple times on one day and then no more on the next day?
Yes, hence my frustration. I tweak and test and think it’s perfect but when I come back to it some time later, it no longer functions as it did. Earlier in the week it was perfectly requesting suspect, means and motive for the “arrest warrant” function. Today, even though that prompt hasn’t been touched, its making up the function parameters instead of requesting them from the user. I’m even using the exact phrasing used in the cookbook examples
Also, as I’m running the game as a chat bot, once the AI makes a mistake, that enters the conversation context and it tends to repeat the mistake
and then you continue testing, or just stop?
If you want consistency you need to set temperature. Do you mind to show your params?
Also if you could give an example prompt we could work on that.
Have you tried using Langchain? Sounds like you’re just trying to do all this natively and that’s a big complicated project.
That problem can be solved with conversation management between the chatbot and the gpt api, if you have that option. alphawave-py / promptrix-py has a ‘repair’ mechanism and conversation management that removes bad replies from the conversation history. Might not apply to your case.