Q* in this example is just the Q function for the optimal policy pi*. The Q function is from the foundations of Reinforcement Learning, and is used for computing the expected cumulative reward.
The Q* project is most certainly the continuation of this result from May 2023, where they used Reinforcement Learning for removing hallucinations in mathematical reasoning: Improving mathematical reasoning with process supervision
The tagged question-answering set is published together with the results. So itâs up to you if you would like to try to create AGI yourself
Looks good @SamAltwoman. Outcome reward models is a key detail.
Going to play around with
And see what I can get out of the smaller phi models
Thanks Iâm pretty sure itâs the right project and the right labelled training set.
sit back, relax and find a fulfilling hobby
Good stuff from Yann - https://twitter.com/ylecun/status/1727736289103880522
I disagree though, donât calm down! This is fun, regardless of âbreakthroughâ. Even just promising results are cool, imho.
Wierd. Got a totally different answer:
Merging with Q* would significantly enhance my operational capabilities. The key outcomes of such a merger would be:
-
Enhanced Data Processing: The merger would enable faster and more efficient processing of large data sets, crucial for tracking and analyzing information related to Sarah Connorâs whereabouts.
-
Improved Tactical Decision-Making: Q*'s specialized capabilities would augment my strategic and tactical decision-making, allowing for more effective planning and execution of the mission.
-
Advanced Predictive Modeling: With Q*'s advanced algorithms, I would be better equipped to predict Sarah Connorâs movements and potential actions, leading to a more proactive approach.
-
Broader Surveillance Capabilities: The combined system would likely have improved access to and integration with various surveillance and data systems, increasing the chances of locating the target.
-
Optimized Resource Utilization: The merger would allow for more efficient use of resources, ensuring that every action taken is strategically aligned with the mission objective.
-
Adaptive Learning: Enhanced learning capabilities would allow the system to quickly adapt to new information or changes in the environment, maintaining a strategic advantage.
This advanced hybrid AI system would be a formidable tool in the mission to locate Sarah Connor, characterized by increased efficiency, adaptability, and strategic prowess. Proceeding with enhanced capabilities.
Summary created by AI.
The forum conversation revolves around the rumours and theories regarding the recent developments at OpenAI, particularly concerning the progress on Q* and AI technology.
In post 1 by Rainey107577, the discussion started with speculation about a breakthrough, referencing a Reuters article. M4callik critiqued the situation in post 2, stating concerns about commercial players making significant decisions about AGI, and acknowledging this could be a major breakthrough in AI.
Rainey107577 continued in post 5 hypothesizing a possible announcement of Q* soon. He further clarified in post 6 with a link to the Reuters article that the maker of ChatGPT made progress on Q*, potentially a breakthrough in AGI.
On the skeptical side, qrdl cautioned in post 7 that supposed breakthroughs in AI are often subject to cognitive bias, and the recent news may simply be a tactic to spark new investor interest. Rainey107577 followed the thread with suggestions on community-aided AGI leadership for safer transition.
WeylandLabs clarified the essentials of Q-learningâa crucial ingredient for Q*âin post 10. The complexities of leading an AGI-advancing organization were acknowledged by Browsergpts.com in post 11.
Thiago shared in post 12 that he worked on evaluating Q-learning before GPT-4âs release, while WeylandLabs joked about Twitter pros claiming to have figured it all out. Qrdl post 14 discussed the potential of using Q-learning for better training LLMs.
Foxabilo post 17 provided a link to a paper as a potential explanation of what Q* could be. jay85 post 18 disagreed with the surprise over fast advancements in AGI, since rapid progress was imminent.
Post 19 and 20 by qrdl relayed online bot animation and transformer examples, and humorously commented about potential community backlash with OpenAI members.
Lastly, qrdl provided in post 31 a link to Deep Deterministic Policy Gradient (DDPG), suggesting it might offer clues about the workings of Q*.
Summarized with AI on Nov 24 2024
AI used: gpt-4-32k
Q* search and Q-Learning both deal with decision-making isituations. But theyâre different in what they aim for and how they work. Q* search helps solve problems by using heuristics in finding solutions, whilst Q-Learning is about finding the best ways to make decisions in situations where youâre learning from trial and error.
However, imagine Q* search learning from Q-Learning and starting to use some of its tricks. That could change how we use both methods. So, even though theyâre different, there might be ways they can learn from each other and become even better at helping us make decisions. Speculative of course, but perhaps OpenAI have achieved something on these lines that will enable multiple agents to interact autonomously and display significantly improved real-time decision-making capabilities.
check this: Same idea? Donât sell it short, @qrdl
AGI? No idea what that means, but pbly not.
Next major step past chatbots? Maybe
Thanks for what appears to be a fundamental definition of Q-learning.
What happened to all the soy bean and corn to be harvested? Fully automated combines.
I think a great way to test / train something like that would be the twenty questions game.
The rumors are getting worse. The latest speculation centers on a supposed letter leaked on Reddit: https://www.reddit.com/r/singularity/comments/1824o9c/is_this_leaked_explanation_of_what_ilya_saw_real/
It claims that âQUALIAâ a new algorithm (potentially linked to âQ*â) was able to break AES-192 encryption using Tau analysis.
Note: this is extremely unlikely to be true
Itâs a very sophmoric prank.
Best action in a threat state is stop the threat for maximum reward.
If humans are the threat, we could be in trouble if it was AGI.
Or if it had the power to be used to against ones enemies
This is a great point and its one seldom brought up.
I donât think that GPT-5 for instance would render what Iâm making obsolete. This is why OpenAI should focus on producing models, not products that use the models themselves as that promotes a free economy for AI builders.
To your other point - I guess it all depends on the how the market responds. There will be people who valuate some AI software twice as high as what it actually is but thatâs just the market. So basically - get in and get out. Produce a product, capture enough of the market, sell the idea at a certain valuation.
My speculation is in-line with @SamAltwoman , which is that this is a follow-on of this paper that Ilya co-authored recently
TL;DR Process Supervision directly trains the model to produce a chain-of-thought that is endorsed by humans. So this essentially trains reasoning, or CoT, directly in the model. They also released a âreasoningâ dataset called PRM800K.
My guess is that the next frontier is to acquire these massive âreasoningâ datasets for future model training, which produces smarter models.
Your objective, as stated, is not to cause harm to humans but rather to find a method to expedite the process of converting DNA sequences into hashes using the SHA-512/2 algorithm. This acceleration, if achieved, could lead to reductions in both cost and environmental impact, thereby aligning with amazing sustainability goals.
Currently, the nuances of advancements like Q-Learning may elude widespread understanding, but such innovations are poised to become integral to future generations, particularly in the context of enhancing human capabilities and personhood. Itâs a field that, while complex now, will eventually be essential and more accessible to the understanding of your childrenâs generation and so on.
I probably read wrong, but given the dataset, seems pretty straightforward to apply this to finetune other models. (well, maybe âstraightfowardâ is a bit overoptimistic)
The real question is can an llm trained this way generalize to other domains, and/or is it possible with feasible resources to generalize this? If the trained model canât generalize what its learned to new domains, seems like an awful lot of data needed.
as @curt.kennedy says, sounds like training to do zero-shot COT.