News from the AI & ML world

DeeperML

Ben Dickson@AI News | VentureBeat //
DeepSeek, a Chinese AI company, has achieved a breakthrough in AI reward modeling that promises to enhance the reasoning and responsiveness of AI systems. Collaborating with Tsinghua University researchers, DeepSeek developed a technique called "Inference-Time Scaling for Generalist Reward Modeling," demonstrating improved performance compared to existing methods and competitive results against established public reward models. This innovation aims to improve how AI systems learn from human preferences, a key factor in developing more useful and aligned artificial intelligence.

DeepSeek's new approach involves a dual method combining Generative Reward Modeling (GRM) and Self-Principled Critique Tuning (SPCT). GRM provides flexibility in handling various input types and enables scaling during inference time, offering a richer representation of rewards through language compared to previous scalar approaches. SPCT, a learning method, fosters scalable reward-generation behaviors in GRMs through online reinforcement learning. One of the paper's authors explained that this combination allows principles to be generated based on the input query and responses, adaptively aligning the reward generation process.

The SPCT technique addresses challenges in creating generalist reward models capable of handling broader tasks. These challenges include input flexibility, accuracy, inference-time scalability, and learning scalable behaviors. By creating self-guiding critiques, SPCT promises more scalable intelligence for enterprise LLMs, particularly in open-ended tasks and domains where current models struggle. DeepSeek has also released models like DeepSeek-V3 and DeepSeek-R1, which have achieved performance close to, and sometimes exceeding, leading proprietary models while using fewer training resources. These advancements signal that cutting-edge AI is not solely the domain of closed labs and highlight the importance of efficient model architecture, training algorithms, and hardware integration.
Original img attribution: https://venturebeat.com/wp-content/uploads/2025/04/deepseek-reward-model.webp?w=1024?w=1200&strip=all
ImgSrc: venturebeat.com

Share: bluesky twitterx--v2 facebook--v1 threads


References :
Classification: