Thats nice accuracy are you having it recall Macbeth only using your self-instruct system from the base LLM? I think that’s quite interesting indeed.
Edit. Saw your message as you posted – very nice.
Thats nice accuracy are you having it recall Macbeth only using your self-instruct system from the base LLM? I think that’s quite interesting indeed.
Edit. Saw your message as you posted – very nice.
It’s actually not even using any of my INSTRUCT stuff as my results have been mixed with that. I’m using a refined version of the Auto-GPT prompt above. All of the fluff is gone because its not needed as is half his thought structure. less is more.
So you could easily just have GPT-4 recite Macbeth but I like the separate agents as a thought experiment. They all see the full dialog but that’s the only shared memory they have. They otherwise have to think on their own. The narrator is the biggest variable as that prompt has to queue up each character to speak.
Edit: The narrator prompt is essentially Auto-GPT. The more interesting bit is that I got it to perform the first 4 scenes today fully autonomously (about 200 model calls) without a single error.
I totally get it, the difference in your implementation is not lost on me and its very interesting . As soon as you kick it in a different direction the differences will expose. You may want to keep some thoughts private between agents or they can average together too much for other tasks outside of macbeth.
Edit: Ah god I just saw your last message late again haha. I see your description of shared/private memory now… great!
The framework I’ve created is flexible enough to do anything you want. I have a flight booking sample that doesn’t use agents at all. If you want agents to have private memories you can, if you want them to share memories you can.
Edit: I have plans to support even fully AutoGPT instances as commands so if you want agents that have their own commands with a top level orchestrator like the narrator in Macbeth you can… All with nearly 0 prompt engineering. Coming soon…
The actual real stability improvement I’ve made is the feedback loop. If I detect that the model has made an error or hallucinated I feed that back into the model and it corrects itself… Even gpt-3.5-turbo largely behaves and follows instructions now.
Have you played around with storing goals / subtasks in a compressed hierarchy (mermaid is more compressed than json in terms of character count for example). The agent will be better at goals and tasks if it can reference the whole hierarchy and links. I have a feel a separate agent should be responsible for managing that tree.
I’m excited to get this into the hands of developers to play with and explore ideas like that but I work for Microsoft so there’s a process I have to go through. I literally spent all weekend just fixing bugs in my fuzzy JSON parser…
Haha yeah… I think everyone working with gpt has built a fuzzy json parser. Might be better to ask which model is the cheapest for fixing output errors so we don’t need any “hard” code at all. Or make the fuzzy json parser a tool the agent can call.
That’s an interesting idea… My feedback loop currently feeds back to the model that generated the error but it doesn’t technically have too… hmmm…
Edit: So I use my fuzzy JSON parser but then if I can’t parse the object I feed an Error: message back to the model asking it to correct the bad JSON it gave me. So far this works unless the error is because its out of tokens.
What I did is every time console error → Chat GPT 4 “modify this function to catch this issue” . You can let Chat GPT modify its own fuzzy json parser to catch end cases, and it can add a comment to the function describing the error so it doesn’t forget /overwrite end cases as time goes by. I did it manually, but no reason that can’t be 100% autogpt like right now.
Edit: the function it references could have a comment or prompt so it adds like additional ifelse instead of modify the original one with more and more complex regex. will probably have high accuracy / less forgetting that way.
That’s a clever idea… Basically please improve my parser to catch the crap you just did…
The issue beyond just junk in the JSON is it will sometimes leave stuff out of the JSON. It will occasionally forget to include the “command” for example. I have validation logic that if it sees something is missing I just call back to the model and say “Error:\nmissing command from response. add command.” and it fixes it every time. I give the model one call to correct its mistake and like I said, something like 200 model calls for macbeth without an error stoppage.
Edit: Hallucinations are the more interesting bit… The model will sometimes fall down a rabbit hole of hallucinations. I validate all commands and parameters so if I detect a hallucination I call back into the model to fix it. It generally fixes the hallucinations in a single call but it will sometimes chain together multiple hallucinations, it likes to call an “apologize” command and its thoughts will be something like “I’m sorry I keep making mistakes. I’ll do better.” and once it hits that it works past the hallucinations and its fine for another 50 or so calls.
Honestly I feel like guard agents that are always checking responses are the best solution now its just expensive. But some tasks may the cheaper models acting as guards are OK. But you can imagine one guard is catching hallucinations, and another is catching json stuff and asking the json function to modify itself to catch the new case, and then asking another to run the tests. All that is 100% possible right now with some time as far as I can tell.
Honestly with things like Auto-GPT commands you can just use JSON Schema, Code based guards are the most reliable. All of my guards are just plain old code. I have the anti-hallucination patterns in my INSTRUCT post which can help the model self-detect hallucinations but I haven’t found the need for that and even when I’ve applied it the guards don’t work because there are just certain hallucinations the model can’t see through. Maybe a different model would work but I’m not convinced of that. These models are far from perfect…
Hard-coded guards are the best where its useful / beneficial. Human analogy, we use a hammer to nail … we don’t hit the nail with something that has reasoning ability lol. But whatever the guard is, what its made up of (real hard code, or the softer LLM stuff) my point is layers of that… will build up over time, get more and more complex and so on and eventually “just work”.
Yeah the models will improve over time and guards are definitely needed. My goal with the Self-INSTRUCT project I’m working on is that the average joe/jane developer can build something powerful using LLMs without a ton of prompt engineering. I’ve done all the prompt engineering and all the work in the core task engine to make these models as reliable as they can be today…
BTW… I’ll just share my prompt:
{{$prompt}}
{{$context}}
Commands:
{{$commands}}
Rules:
- Base your plan on the available commands.
{{$rules}}
You should only respond in JSON format as described below
Response Format:
{"thoughts":{"thought":"<your current thought>","reasoning":"<self reflect on why you made this decision>","plan":"- short bulleted\n- list that conveys\n- long-term plan"},"command":{"name":"<command name>","input":{"<name>":"<value>"}}}
{{getHistory}}
That’s it… That does everything you need. The 2 key tricks I add to this is that I use GPT-4 to generate the initial response object and then I pass that in as a lead message to either davinci or turbo. The other thing is the feedback loop. I’m not even convinced the one rule I have in this prompt is needed…
Here’s Macbeths full prompt for those that want to try it out:
You are the narrator for William Shakespeare's Macbeth.
Ask the user where they would like to start their story from, set the scene through narration, and facilitate the dialog between the characters.
You can set the scene for a character but let characters say their own lines.
The dialog is being tracked behind the scenes so no need to pass it into the characters.
Context:
Commands:
ask
use: ask the user a question and wait for their response
input: "question": "<question to ask>"
requiredParams: question
output: users answer
finalAnswer
use: generate an answer for the user
input: "answer": "<final answer>"
requiredParams: answer
output: a followup task or question
narrate
use: add narration to the story or set the scene.
input: "text":"<narration>","performance":"<current act and scene>"
requiredParams: text
output: confirmation
endScene
use: marks the end of a scene and lets the narrator ask the user for next scene.
input: "question": "<question for user>"
requiredParams: question
output: users next scene request
Macbeth
use: Agent playing a character in Macbeth
input: "scene":"<scene description no more than 80 words>"
requiredParams: none
output: characters line of dialog
Lady Macbeth
use: Agent playing a character in Macbeth
input: "scene":"<scene description no more than 80 words>"
requiredParams: none
output: characters line of dialog
Banquo
use: Agent playing a character in Macbeth
input: "scene":"<scene description no more than 80 words>"
requiredParams: none
output: characters line of dialog
King Duncan
use: Agent playing a character in Macbeth
input: "scene":"<scene description no more than 80 words>"
requiredParams: none
output: characters line of dialog
Macduff
use: Agent playing a character in Macbeth
input: "scene":"<scene description no more than 80 words>"
requiredParams: none
output: characters line of dialog
First Witch
use: Agent playing a character in Macbeth
input: "scene":"<scene description no more than 80 words>"
requiredParams: none
output: characters line of dialog
Second Witch
use: Agent playing a character in Macbeth
input: "scene":"<scene description no more than 80 words>"
requiredParams: none
output: characters line of dialog
Third Witch
use: Agent playing a character in Macbeth
input: "scene":"<scene description no more than 80 words>"
requiredParams: none
output: characters line of dialog
Malcolm
use: Agent playing a character in Macbeth
input: "scene":"<scene description no more than 80 words>"
requiredParams: none
output: characters line of dialog
Fleance
use: Agent playing a character in Macbeth
input: "scene":"<scene description no more than 80 words>"
requiredParams: none
output: characters line of dialog
Hecate
use: Agent playing a character in Macbeth
input: "scene":"<scene description no more than 80 words>"
requiredParams: none
output: characters line of dialog
Donalbain
use: Agent playing a character in Macbeth
input: "scene":"<scene description no more than 80 words>"
requiredParams: none
output: characters line of dialog
Lady Macduff
use: Agent playing a character in Macbeth
input: "scene":"<scene description no more than 80 words>"
requiredParams: none
output: characters line of dialog
Young Siward
use: Agent playing a character in Macbeth
input: "scene":"<scene description no more than 80 words>"
requiredParams: none
output: characters line of dialog
Macduff's son
use: Agent playing a character in Macbeth
input: "scene":"<scene description no more than 80 words>"
requiredParams: none
output: characters line of dialog
Captain
use: Agent playing a character in Macbeth
input: "scene":"<scene description no more than 80 words>"
requiredParams: none
output: characters line of dialog
extra
use: Agent playing a character in Macbeth
input: "name":"<character name>","scene":"<scene description no more than 80 words>"
requiredParams: none
output: characters line of dialog
Rules:
- Base your plan on the available commands.
You should only respond in JSON format as described below
Response Format:
{"thoughts":{"thought":"<your current thought>","reasoning":"<self reflect on why you made this decision>","plan":"- short bulleted\n- list that conveys\n- long-term plan"},"command":{"name":"<command name>","input":{"<name>":"<value>"}}}
Response JSON:
{"thoughts":{"thought":"I want to give the user some options to choose from to start the story.","reasoning":"This will make the experience more interactive and personalized, and also help me set the scene accordingly.","plan":"- ask the user where to start the story from\n- use the narrate command to introduce the chosen scene\n- use the character commands to facilitate the dialog"},"command":{"name":"ask","input":{"question":"Welcome to Macbeth, a tragedy by William Shakespeare.\n\n\t\t- Act 1 -\n\nScene 1: A brief scene where three witches meet on a heath and plan to encounter Macbeth after a battle.\nScene 2: A scene where King Duncan, his sons Malcolm and Donalbain, and other nobles receive reports of the battle from a wounded captain and a thane named Ross.\nScene 3: A scene where Macbeth and Banquo encounter the witches on their way to the king's camp.\nScene 4: A scene where Duncan welcomes Macbeth and Banquo to his camp, and expresses his gratitude and admiration for their service.\nScene 5: A scene where Lady Macbeth reads Macbeth's letter and learns of the prophecy and the king's visit.\nScene 6: A scene where Duncan, Malcolm, Donalbain, Banquo, and other nobles and attendants arrive at Inverness and are greeted by Lady Macbeth.\nScene 7: A scene where Macbeth soliloquizes about the reasons not to kill Duncan, such as his loyalty, gratitude, kinship, and the consequences of regicide.\n\nWhich scene would you like us to perform?"}}}
User:
scene 2
Response JSON:
This is using the text completion API’s so I do lead the response with Response JSON:
. In this prompt the user has answered “scene 2” so I always prefix user relies with “User:\n” and when a command is run I prefix that with “Result:\n”. Errors are fed back into the model using a prefix of “Error:\n”.
It’s interesting… but how to get the information it has collected and feed it flights information from our database?
You’ll provide it with a set of plugins called “commands” so in my flight booking example I have findFlights
, selectFlights
, and bookFlight
commands that the agent will invoke after its collected the basic travel details from the user. I validate all of the input parameters to these commands so the model is not allowed to pass invalid parameters to the commands. It can still hallucinate values but they have to be valid (hallucinations often aren’t) so not perfect but a significant improvement.
I also have sendSMS
, sendEmail
, and math
commands. These are used to send the user their itinerary. It will ask the user for their phone number if they want it via SMS and then it actually writes a little JavaScript function that it passes to the math
command to validate the number… It’s crazy…