News from the AI & ML world

DeeperML - #reinforcementlearning

Megan Crouse@techrepublic.com //
Researchers from DeepSeek and Tsinghua University have recently made significant advancements in AI reasoning capabilities. By combining Reinforcement Learning with a self-reflection mechanism, they have created AI models that can achieve a deeper understanding of problems and solutions without needing external supervision. This innovative approach is setting new standards for AI development, enabling models to reason, self-correct, and explore alternative solutions more effectively. The advancements showcase that outstanding performance and efficiency don’t require secrecy.

Researchers have implemented the Chain-of-Action-Thought (COAT) approach in these enhanced AI models. This method leverages special tokens such as "continue," "reflect," and "explore" to guide the model through distinct reasoning actions. This allows the AI to navigate complex reasoning tasks in a more structured and efficient manner. The models are trained in a two-stage process.

DeepSeek has also released papers expanding on reinforcement learning for LLM alignment. Building off prior work, they introduce Rejective Fine-Tuning (RFT) and Self-Principled Critique Tuning (SPCT). The first method, RFT, has a pre-trained model produce multiple responses and then evaluates and assigns reward scores to each response based on generated principles, helping the model refine its output. The second method, SPCT, uses reinforcement learning to improve the model’s ability to generate critiques and principles without human intervention, creating a feedback loop where the model learns to self-evaluate and improve its reasoning capabilities.

Recommended read:
References :
  • hlfshell: DeepSeek released another cool paper expanding on reinforcement learning for LLM alignment. Building off of their prior work (which I talk about here), they introduce two new methods.
  • www.techrepublic.com: Researchers from DeepSeek and Tsinghua University say combining two techniques improves the answers the large language model creates with computer reasoning techniques.

@www.analyticsvidhya.com //
Google's DeepMind has achieved a significant breakthrough in artificial intelligence with its Dreamer AI system. The AI has successfully mastered the complex task of mining diamonds in Minecraft without any explicit human instruction. This feat, accomplished through trial-and-error reinforcement learning, demonstrates the AI's ability to self-improve and generalize knowledge from one scenario to another, mimicking human-like learning processes. The achievement is particularly noteworthy because Minecraft's randomly generated worlds present a unique challenge, requiring the AI to adapt and understand its environment rather than relying on memorized strategies.

Mining diamonds in Minecraft is a complex, multi-step process that typically requires players to gather resources to build tools, dig to specific depths, and avoid hazards like lava. The Dreamer AI system tackled this challenge by exploring the game environment and identifying actions that would lead to rewards, such as finding diamonds. By repeating successful actions and avoiding less productive ones, the AI quickly learned to navigate the game and achieve its goal. According to Jeff Clune, a computer scientist at the University of British Columbia, this represents a major step forward for the field of AI.

The Dreamer AI system, developed by Danijar Hafner, Jurgis Pasukonis, Timothy Lillicrap and Jimmy Ba, achieved expert status in Minecraft in just nine days, showcasing its rapid learning capabilities. One unique approach used during training was to restart the game with a new virtual universe every 30 minutes, forcing the algorithm to constantly adapt and improve. This innovative method allowed the AI to quickly master the game's mechanics and develop strategies for diamond mining without any prior training or human intervention, pushing the boundaries of what AI can achieve in dynamic and complex environments.

Recommended read:
References :
  • techxplore.com: Google's AI Dreamer learns how to self-improve over time by mastering Minecraft
  • Analytics Vidhya: What if I told you that AI can now outperform humans in some of the most complex video games? AI now masters Minecraft too.
  • eWEEK: The new Dreamer AI system figured out how to conduct the multi-step process of mining diamonds without being taught how to play Minecraft.
  • www.scientificamerican.com: The Dreamer AI system of Google's DeepMind reached the milestone of mastering Minecraft by ‘imagining’ the future impact of possible decisions

@Communications of the ACM //
Andrew G. Barto and Richard S. Sutton have been awarded the 2024 ACM A.M. Turing Award for their foundational work in reinforcement learning (RL). The ACM recognized Barto and Sutton for developing the conceptual and algorithmic foundations of reinforcement learning, one of the most important approaches for creating intelligent systems. The researchers took principles from psychology and transformed them into a mathematical framework now used across AI applications. Their 1998 textbook "Reinforcement Learning: An Introduction" has become a cornerstone of the field, cited more than 75,000 times.

Their work, beginning in the 1980s, has enabled machines to learn independently through reward signals. This technology later enabled achievements like AlphaGo and today's large reasoning models (LRMs). Combining RL with deep learning has led to major advances, from AlphaGo defeating Lee Sedol to ChatGPT's training through human feedback. Their algorithms are used in various areas such as game playing, robotics, chip design and online advertising.

Recommended read:
References :
  • Communications of the ACM: Barto, Sutton Announced as ACM 2024 A.M. Turing Award Recipients
  • THE DECODER: Algorithms from the 1980s power today's AI breakthroughs, earn Turing Award for researchers
  • SecureWorld News: Trailblazers in AI: Barto and Sutton Win 2024 Turing Award for Reinforcement Learning
  • ??hub: Andrew Barto and Richard Sutton win 2024 Turing Award
  • TheSequence: Some of the pioneers in reinforcement learning received the top award in computer science.
  • mastodon.acm.org: ACM Recognizes Barto and Sutton for Developing Conceptual, Algorithmic Foundations of Reinforcement Learning

Ryan Daws@AI News //
Alibaba's Qwen team has launched QwQ-32B, a 32-billion parameter AI model, designed to rival the performance of much larger models like DeepSeek-R1, which has 671 billion parameters. This new model highlights the effectiveness of scaling Reinforcement Learning (RL) on robust foundation models. QwQ-32B leverages continuous RL scaling to demonstrate significant improvements in areas like mathematical reasoning and coding proficiency.

The Qwen team successfully integrated agent capabilities into the reasoning model, allowing it to think critically, use tools, and adapt its reasoning based on environmental feedback. The model has been evaluated across a range of benchmarks, including AIME24, LiveCodeBench, LiveBench, IFEval, and BFCL, designed to assess its mathematical reasoning, coding proficiency, and general problem-solving capabilities. QwQ-32B is available as open-weight on Hugging Face and on ModelScope under an Apache 2.0 license, allowing for both commercial and research uses.

Recommended read:
References :
  • AI News | VentureBeat: Alibaba's new open source model QwQ-32B matches DeepSeek-R1 with way smaller compute requirements
  • Analytics Vidhya: In the world of large language models (LLMs) there is an assumption that larger models inherently perform better. Qwen has recently introduced its latest model, QwQ-32B, positioning it as a direct competitor to the massive DeepSeek-R1 despite having significantly fewer parameters.
  • AI News: The Qwen team at Alibaba has unveiled QwQ-32B, a 32 billion parameter AI model that demonstrates performance rivalling the much larger DeepSeek-R1. This breakthrough highlights the potential of scaling Reinforcement Learning (RL) on robust foundation models.
  • www.infoworld.com: Alibaba Cloud on Thursday launched QwQ-32B, a compact reasoning model built on its latest large language model (LLM), Qwen2.5-32b, one it says delivers performance comparable to other large cutting edge models, including Chinese rival DeepSeek and OpenAI’s o1, with only 32 billion parameters.
  • THE DECODER: Alibaba's latest AI model demonstrates how reinforcement learning can create efficient systems that match the capabilities of much larger models.
  • bdtechtalks.com: Alibaba’s QwQ-32B reasoning model matches DeepSeek-R1, outperforms OpenAI o1-mini
  • Last Week in AI: Alibaba’s New QwQ 32B Model is as Good as DeepSeek-R1
  • Last Week in AI: LWiAI Podcast #202 - Qwen-32B, Anthropic's $3.5 billion, LLM Cognitive Behaviors
  • Last Week in AI: #202 - Qwen-32B, Anthropic's $3.5 billion, LLM Cognitive Behaviors

Ryan Daws@AI News //
Alibaba's Qwen team has introduced QwQ-32B, a 32 billion parameter AI model that rivals the performance of the much larger DeepSeek-R1. This achievement showcases the potential of scaling Reinforcement Learning (RL) on robust foundation models. The Qwen team has successfully integrated agent capabilities into the reasoning model, enabling it to think critically and utilize tools. This highlights that scaled reinforcement learning can lead to significant advancements in AI performance without necessarily requiring immense computational resources.

QwQ-32B demonstrates that RL scaling can dramatically enhance model intelligence without requiring massive parameter counts. QwQ-32B leverages RL techniques through a reward-based, multi-stage training process. This enables deeper reasoning capabilities, typically associated with much larger models. QwQ-32B has achieved performance comparable to DeepSeek-R1 which underscores the potential of RL to bridge the gap between model size and performance.

Recommended read:
References :
  • AI News | VentureBeat: Alibaba’s new open source model QwQ-32B matches DeepSeek-R1 with way smaller compute requirements
  • MarkTechPost: Qwen Releases QwQ-32B: A 32B Reasoning Model that Achieves Significantly Enhanced Performance in Downstream Task
  • Analytics Vidhya: Discussion on Qwen Chat, noting QwQ-32B’s capabilities.
  • AI News: Alibaba Qwen QwQ-32B: Scaled reinforcement learning showcase
  • Simon Willison's Weblog: QwQ-32B: Embracing the Power of Reinforcement Learning
  • Analytics Vidhya: QwQ-32B Vs DeepSeek-R1: Can a 32B Model Challenge a 671B Parameter Model?
  • MarkTechPost: Alibaba Released Babel: An Open Multilingual Large Language Model LLM Serving Over 90% of Global Speakers
  • www.infoworld.com: Alibaba says its new AI model rivals DeepSeeks’s R-1, OpenAI’s o1
  • IEEE Spectrum: QwQ, DeepSeek-R1 32B, and Sky-T1-R had the highest overthinking scores, and they weren’t any more successful at resolving tasks than nonreasoning models.
  • THE DECODER: Alibaba's QwQ-32B is an efficient reasoning model that rivals much larger AI systems
  • eWEEK: Alibaba unveils QwQ-32B, an AI model rivaling OpenAI and DeepSeek with 98% lower compute costs. A game-changer in AI efficiency, boosting Alibaba’s market position.
  • SiliconANGLE: Alibaba shares jump on new open-source QwQ-32B reasoning model
  • Last Week in AI: Alibaba’s New QwQ 32B Model is as Good as DeepSeek-R1 , Judge Denies Musk’s Request to Block OpenAI’s For-Profit Plan, Alexa Plus’ AI upgrades cost $19.99, and more!
  • Last Week in AI: Alibaba released Qwen-32B, Anthropic raised $3.5 billion,DeepMind introduced BigBench Extra Hard, and more!
  • bdtechtalks.com: Alibaba's QwQ-32B is a new large reasoning model (LRM) with high performance on key benchmarks, improved efficiency and open-source access.
  • Groq: With a community of over one million developers who build FAST, Groq can’t help but want to keep up.
  • Maginative: Information on how Alibaba's Latest AI Model, QwQ-32B, Beats Larger Rivals in Math and Reasoning
  • Analytics Vidhya: Small Model with Huge Potential.
  • Towards AI: Performance Analysis Between QWQ-32B and DeepSeek-R1 and How to Run QWQ-32B Locally on Your Machine