First, welcome to the forum, were glad to have you and appreciate your original contribution here.
I read through your paper and while it is certainly intriguing, it falls somewhat short of compelling for me due to the lack of quantifiable metrics.
What I would love to see with this is a proposed benchmark by which you could compare your approach to others—some kind of headline numbers which demonstrate the value of your contribution.
I’m any event, after a quick read I think you’ve got something interesting here.
Thanks for the suggestion on metrics, planning for that in the next iteration of the paper.
BTW, finding the right metrics to compare with other work is a bit less obvious than usual as the task covered by DeepLLM is new, but any ideas on what benchmarks might be relevant are welcome.
In the meantime, trying it out at deepllm.streamlit.app, on your favorite technical topic might help finding out quickly if the generated model collects salient information about it.
For sure! That’s the beauty though of being among the first movers—you get to choose/design the benchmark.
One thing you might consider is to identify what you think your absolute best use-cases are, come up with maybe 30–50 exemplars, then compare your results against base GPT-4 results. Ideally you’d include some particularly tough tasks so your approach doesn’t score 100% so you and others have somewhere to go with it.
One would expect base GPT-4 to perform poorly, but that’s the point, right? You want to demonstrate how your approach surfaces new abilities. So you might start there, then maybe look at things like AutoGPT and AgentGPT if you think they’re comparable, or possibly some other prompting techniques.
You’re right when you say it’s not easy, but you presumably decided this approach with an unsolved problem in mind, so I would start there.
If I’ve got any extra time this week I’ll try to give your paper a second read and let you know if anything springs to mind.
I just deployed to deepllm.streamlit.app an update that generates for each dialog thread a set of SVO relations connecting the generated concepts. The app also enables visualizing them with an interactive pyvis graph.
That allows exploring at deeper recursion levels scientific concepts, consequence predictions, causal explanations or step-by-step guidance to achieve a goal.
Well, that’s at zero cost, assuming you have an RTX 3090 or 4090 GPU humming in your PC and helping to keep your office room warm
I’m always excited when I see you’ve shared something because I can be confident it’s thoughtful, well-implemented, useful, and—most importantly—interesting!
I think you’ve got something really special here and I look forward to seeing where it all ends up!
Here’s a metric suggestion: Given a complex code function, can your approach keep GPT on track to not regress while being asked to add new functionality to the function. You can build unit tests and simply run them. Hopefully base GPT can be shown causing more tests to fail when adding new functionality than your guided version.
DeepQA is a DeepLLM-based application that explores recursively the “mind-stream” of an LLM via a tree of self-generated follow-up questions.
You can fetch it from github or try out its streamlit app online.
After started from your initiator question on a topic of your choice, it explores its tree of follow-up questions. You can import the generated Definite Clause Grammar as part of a Prolog program. It replicates symbolically the equivalent of the “mind-stream” extracted from the LLM interaction, with possible uses of the encapsulated knowledge in Logic Programming applications.
The synthesized grammar is designed to generate a finite language (by carefully detecting follow-up questions that would induce loops). We also ensure that paths in the question-answer tree are free of repeated answers, which get collected as well, together with open questions left unanswered when reaching the user-set depth limit.
You can also use DeepQA to quickly assess the strength and limitations of an LLM before committing to it. For instance, when used with a much weaker than GPTx local LLM (enabled with Vicuna 7B by default) you will see shorter, more out of focus results, with a lot of repeated questions and answers collected by DeepQA in corresponding bins. By contrast, results with the latest GPT-turbo look mildly superhuman
Would you be willing to share, time permitting, some examples of this in action?
It would be great if you could post here a few outputs from each of the models you’ve tested so others who may not have ready access or the ability to run local models can see what the performance difference is among different models.
I’ll reiterate from my previous comment too, it would be amazing if you could distill performance down to a few numeric metrics so you’d be able to compare performance at a glance.
It would also give users (and you) a baseline from which to test and evaluate iterations on your methods.
DocDiver is a new DeepLLM app that reviews or summarizes technical papers. It also lets you chat with it or, in self-driving mode, you can watch the LLM dive deep into the document’s content. At the end, a graph of relations is extracted in the form of JSON or Prolog code that can be also visually explored as an interactive pyvis graph.