I think this function calling approach is better at one-shot prompts.
In my opinion it is beneficial to change the models forth and back for the tasks at hand and go back to good old GUI design.
I mean, why to ask for “What is the weather in San Francisco, CA for tomorrow?” if you could have a simple search field, where you input the first 3-4 letters and have a list to choose from and press a button to show the results in a modern graphical view, like rain radar-, cloud-, wind-maps etc.
Are we really back to the days, where text matters so much? No, even if NLP tasks have their advantages like (multi-language, accessibilities for color-blinds, voice-to-text, etc) we should focus back on a balance between GUI and NLP tasks and using prompt input fields where it is appropriate to use it for. And use buttons, sliders, tree-views, labels etc in a traditional way it was good before.
I believe we are in a time, where ChatGPT is only a half-way solution to a more interactive way. If the systems are ready to fully leverage speech synthesis, universal translator etc, then plugins will be most interesting. Especially when Microsoft launches their Windows Copilot and the ability to have plugins as services. A modular Windows System tailored for the user privat or business.
I am also building a game. My multiplayer video game (Chasm Conquerors’ Challenge) is very large, so I am going to get the players of my game to build almost all of it. (Possibly you may end doing the same thing. Possibly we could help each other to build our respective games. I am currently using GPT-4 to create four extremely unorthodox main game design documents.)
Numbers are ALIVE! Play is learning to learn. Language is a game that you win, when you convey your meaning clearly and succinctly. Ergo working with AI is a very fluid, dynamic and organic process. Moreover GPT-4 is a baby AI, so it stands to reason it will have teething issues. I suspect GPT-5 through GPT-7 will also be baby AIs.
Try GOING with the FLOW! Try thinking outside the box. Try utilizing extremely unorthodox methodologies. Try adapting your game to allow for GPT-4’s weaknesses and to take advantage of GPT-4’s strengths. Imagine yourself in a canoe in very young, raging river - that is GPT-4. All you can do is try to avoid the rocks, survive any waterfalls and make your way to shore as swiftly as you can. Yes the documentation is of exceedingly little use, as even its creators do not fully understand GPT-4. Ergo we are all heading into totally unchartered waters - GPT-4’s creators inclusive. Hence far and away the best advice is:
NEVER give in! NEVER give up! NEVER count the odds!
I used to be a game dev, and believe me, it is mostly thankless and lots of iterating is required. But in the end, it is something that you should do for yourself, not for other people. You should think of it as a hill that you intend on conquering. A hill is a hill no matter how you look at, the only thing that matters is “do you want to get to the top of it or not”?
I’d say don’t give up hope on it, it seems really cool and when you release it, it will ignite the game development community’s imagination on fire. There are some really good techniques out there that you can lean on to get some good results. You’ll just have to keep on iterating. I’d say add some rule checking in on the system. Simple prompts like:
"Does the character reference them self?"
"Does the character's line match their backstory?"
I’ve outlined how I have kept an experimental sales representative bot on track here:
I’ve learned better techniques (Like two shot examples you have mentioned), so mine are a little outdated. But the bot was good enough to talk to customers and stay on topic. Additionally, though, if you are worried about quality control, you could take a more guarded approach and either modify the user’s questions to conform to something that gets the results you are looking for, or you can generate several questions from a player query like “how did you get here?” where the questions generated are in the style and form you need for good results.
In gaming it’s all about giving the player the illusion of being able to do more. They don’t necessarily need to interact with the AI in a flawlessly human way, it just needs to feel that way. So if you need to make template responses that get modified and Frankensteined together or guide users in the prompt crafting, then you can do that. You’re the first one making this type of game, so you get to set the standard and people just have to deal with it. Trust me, as long as it’s fun, nobody will notice or care. You think anyone really asks why we are limited to 52 cards 4 suits and 2 colors in card games? Nope, that’s just how it is.
It seems there is a misconception about what the perceived issue is: we are talking about doing the same thing as before and getting different results. It’s not about never having been able to get good results in the first place.
Put differently: if I use ChatGPT4 and copy& paste my prompts then I am doing this because my prompts deliver the expected results. If this changes and the prompts don’t produce the expected results any longer then that’s what makes people question the service and not themselves. A healthy attitude by the way but also not super helpful at times
My experience is that everytime there is a new model in the background then a rewrite of the prompts is needed, which is somewhat expected from a beta version. It also requires to adapt the way of communicating during a chat conversation.
But then again, this is not what OP is describing. OP is additionally talking about getting different results from one day to the other and this can maybe be explained with a small sample size.
So hang in there and if you made it this far you can conquer this hill as well!
Have you created a sanity check to help filter incorrect results using ChatGPT itself? For instance, have it set the basics of the game:
The Chef committed the Crime
The maid seems suspicious
The butler gave the chef a gun
Then use these as a litmus test to reparse any answers and reject them if they don’t follow the laid out paradigm by interrogating the answers themselves before they are shown? It would help control hallucination. The other way to do it is to “somewhat” pre-can a master set of results, choose from those, but have ChatGPT rewrite them with variation.
I’ve also been working on text adventures using GPT, and gave up on 3.5 pretty fast. GPT-4 is much more rational and coherent. 3.5 might be cheaper, but I think this is a use case where you need GPT-4 if you want joy.
Probably levers you can pull/push to get the desired behaviour:
Give more examples using the user/assistant tuple: system message, user, assistant user assistant, user, assistant, user: (however more work it is for you, this helps). The more the better.
Use gpt-4 (if its worth the cost. gpt-3.5 is not production ready)
Limit the output number of token. (add this in system message also)
Obtain more than 1 variation of answers using the n parameter. Have a separate evaluator gpt independently check of the response is of the required format and tone, if not check the next version.
Setting low temperature does not help. If gpt was making mistakes, it will repeat the mistakes in 0 temperature. Have a higher temperature (0.7) you have more chances of any one of the responses being perfect.
I am in the same boat as you. Everything working great. I pop in functions. Then things are BAAD. I would think it is functions which is creating the problem, the model itself may be okay. I am thinking of reverting to the default completion endpoint instead of the chat completion end point, even though the chat completion end point is 1/10th the price.
I like your thinking. Separating concerns is a programming principle that I believe supports higher quality responses, allows for focused evals, and helps construct precise & accurate prompt/response pairs.
Allowing GPT to explicitly reason out it’s answer first also tends to lead to higher quality reasons I’m my experience.
I’m seeing GPT shift from q conversational agent into a single-responsibility & purpose reasoning engine. It may lose context quick, but it’s answers are much more consistent and accurate
I wonder, does it make more sense to separate these concerns into a graph-based Jupyter-like structure? Genuinely curious. Would love some thoughts from all walks
OP touches on a very important Product issue that hopefully OpenAI has the eyes open to.
It’s called foundational models → developers don’t want the foundations shaken / deprecated. If you build a living on spending months on tuning prompts, you can’t force the transition to different foundations, especially if the new foundations don’t allow your product to perform as well anymore.
OpenAI needs to seriously evaluate the product strategy and the promise to whomever wants to build a living on their product. Understand there are plenty of reasons to ‘expire’ models, but consider the relationship to developers and the programming model (enriching vs expiring) going forward.