I will add that I’ve tried everything I can think of to eliminate these hallucinated values without much success. There are just certain hallucinations the model can’t see through… As an example, without proper guidance the model will hallucinate that the users departure date is today. It’s picking that up from the fact I pass the current date into the model. You can remove the date from the prompt and it will stop but then you break things like the user asking for relative dates “I’d like to leave next monday and return that friday.” I was able to correct for this by giving the model a rigid script to follow that results in it being forced to ask the user their departure and return dates but this rigid script has its own tradeoffs…
Nothing about this is easy and this stuff barely works… I’d say it’s all very promising at this point.
Don’t know why it took me so long to stumble onto this thread. Fantastic!
Question: Do you think we could use this to have gpt create python code to efficiently replicate expensive special-purpose gpt uses?
For example, my web search plugin (llmsearch on github) uses gpt-3.5-turbo to do the final step of information extraction from a web page. Since it can grab 20-30 pages in response to a search request, that is expensive and slow, even with the gpt calls
So, idea - use your framework to iteratively construct and refine python code that matches the output of gpt-3.5 on a specific narrow task.
Starting on this right now, let you know how it turns out. I imagine it will need a starting hint (initial python code), we’ll see.
So the way I would structure that would be that I’d have a command that generates your initial code, a second that tests it, and a third that refines it. Your test can use a page you previously extracted and tell the model to keep looping over the test and refine commands until the test passes. And if a test fails be very explicit about what went wrong, exception thrown, or wrong data returned. The model needs this feedback to pass into the refine command.
If your main prompt has any problems post it here and I’ll try to help. Good luck!
AutoGPT is the biggest hype machine I’ve ever seen, I am far more productive manually prompting GPT4 than letting this thing flail around until it hits a death loop.
Yeah their prompt doesn’t work very well. If it completes a task it pretty much lucked its way to completing that task. My AutoGPT style prompt works way better. In fact I’d say its the opposite with my prompt. It’s pretty reliable at completing whatever tasks you throw at it but every once in a while it gets stuck and can’t complete its task. I’m working on that…
When I looked at AutoGPT, I knew it wasn’t right, although I didn’t have the skill to say why or do much about it. I knew that AI automation was BIG. Your approach is precisely what resonates with me. It makes sense, it is more understandable, and it delivers the promise of AGI automation in a practical way.
Kudos to Toran though… The real break through he made is the chain-of-thought his prompt establishes across multiple turns. The inner monolog the prompt sets up works but it’s only one piece of the puzzle. Proper command selection is important, how you describe the commands to the model is important, and the constructive feedback you give the model when it makes a mistake is probably the most important.
But the key to programming succesfully with GPT-4 is not only that loop but in addition and very important is you need break the current task down into the smallest possible subtask. Where the code can make the smallest change possible, test it as soon as possible, before moving on. In this way the accuracy is so high. There is some square law or something where as interdependency complexity increases so follows accuracy loss.
Sometimes you need to think creatively and most of the work is in finding how to break a task down into a small enough chunk. And in theory all of this already works the only limit now is time / cost.
PS:
It is good to instruct/identify in simple measures where to put input/output logging for each loop and remove after… like print or console.log and if feedback that into the debug portion its powerful. There are other tricks of the trade that might develop. One fun thing I played with is column selection (eg in VS code Selection > Column Selection Mode asa compressed tradeoff to express structure over fidelity … ) Folding and unfolding code automatically will also probably be important and so on and so forth.
Yeah, I’m a long time AI guy, I’m as skeptical of current “AGI” frameworks for this application as you. But for this application, I have a decent starting point for the code, and don’t need huge improvement in the code for it to be good enough to enable it to replace gpt-3.5.
btw @stevenic - Roadblock - I’m working in a linux/python environment, so I’m trying to decide how to use/adapt/rewrite self-instruct.
I’m also very skeptical of all these autonomous AI agents, OpenAI already tried something similar and found that GPT-4 was ineffective at various tasks needed to self sustain.
For anyone interested it’s section 2.9 of the technical report:
ARC found that the versions of GPT-4 it evaluated were ineffective at the autonomous replication task based on preliminary experiments they conducted. These experiments were conducted on a model without any additional task-specific fine-tuning, and fine-tuning for task-specific behavior could lead to a difference in performance. As a next step, ARC will need to conduct experiments that (a) involve the final version of the deployed model (b) involve ARC doing its own fine-tuning, before a reliable judgement of the risky emergent capabilities of GPT-4-launch can be made.
These AI agents do seem useful at extruding/extracting information from the model, but breaking into things to do nefarious things (that involve higher level interactions) maybe not.
Which section are you referring to? 2.9 talks about ARC which found that the versions of GPT-4 it evaluated were ineffective at the autonomous replication task based on preliminary experiments they conducted, but I don’t think that’s necessarily related.
Also, their experiments seemed fairly shallow as far as I could tell.
Yeah, exactly. I really didn’t see anything in the report to indicate that agents are unlikely.
That said, I believe the folks who are making these things are going about it the wrong way. They should solve some narrow and compelling use case first, and then slowly broaden it out / generalize to other use cases.
My use case for creating my own agent “CurtGPT” was to extract information from the model and use this as a basis for queries in the future about this information (through embeddings). That’s compelling to me!
That sounds good. Did you narrow the agent controller logic specifically around that use case?
I actually tried creating a generic agent quite a long time ago and I found it could far too easily go off the rails, so I stopped. I now believe it’s possible to do (at least for me), but I believe you need to ensure your control flow logic is written carefully for whatever it is doing.
There might be ways to generalize, but I haven’t figured out yet how best to do that and haven’t see anything that can convincingly do so.
Yeah, I basically state an objective and an initial task. Then it iterates through the first round of tasks and tries to satisfy the objective. This is how it stays on task.
I have an initial set of code here, but have since upgraded the prompts to be more reliable and have also added multi-layers of task searching (2-deep instead of 1-deep), ditched the dumpTask nonsense and general cleanup. Working on the embedding version with AWS soon, and finally full cloud deployment using lambda functions so I can parallelize the hell out of it!
I’ve been playing around with the prompt’s in CurtGPT as well, removed the “you are” part of everything, phrased it in imperative, and used “delineate” instead of “expound” on the initial run of the main objective
Fund it slightly more reliable with the “you are” part removed. It also seems to pull more basic knowledge needed for later expounding when using “delineate” on the initial objective