News from the AI & ML world

DeeperML - #llms

@pub.towardsai.net //
Towards AI has announced the release of Lesson 6 in their popular 10-Hour LLM Primer course. This new lesson focuses on advanced techniques for gaining "real control" over Large Language Models (LLMs), moving beyond basic prompting and retrieval. It aims to equip professionals with the knowledge to effectively fine-tune open models, even with limited datasets of just a few hundred examples. The lesson promises to guide users on when to undertake fine-tuning, how to do it efficiently, and critically, how to determine if the fine-tuning process has been successful.

The curriculum delves into crucial fine-tuning methods such as LoRA (Low-Rank Adaptation) and RLHF (Reinforcement Learning from Human Feedback), along with other related techniques like QLoRA and reinforcement learning with methods like PPO, DPO, and GRPO. A significant portion of the lesson is dedicated to understanding and avoiding common pitfalls like overfitting, underfitting, and hallucinations, ensuring more robust and reliable LLM behavior. Additionally, the course includes a practical walkthrough of training using Unsloth, a framework that enables efficient training even on free GPU resources.

This expanded lesson is part of the broader 10-Hour LLM Primer, which is designed for software professionals but accessible to anyone interested in understanding LLMs. The course covers essential skills for production-ready AI applications, including model evaluation, agent workflows, tool integration, and optimization principles like quantization and prompt injection mitigation. Towards AI highlights that this comprehensive approach empowers users to go beyond basic LLM interaction and develop customized, efficient, and safe AI solutions.

Recommended read:
References :
  • academy.towardsai.net: This course is initially designed as a 1-day Bootcamp for Software Professionals (language agnostic).
  • pub.towardsai.net: If you’ve watched the first two tutorials in the 10-hour LLM Primer, you already know what prompting can do, and you’ve seen how retrieval takes it a step further.
  • towardsdatascience.com: How to Fine-Tune Small Language Models to Think with Reinforcement Learning
  • Towards AI: Lesson 6 is Live: Fine-Tuning, LoRA, RLHF & the Tools That Give You Real Control

nftjedi@chatgptiseatingtheworld.com //
Apple researchers recently published a study titled "The Illusion of Thinking," suggesting that advanced language models (LLMs) struggle with true reasoning, relying instead on pattern matching. The study presented findings based on tasks like the Tower of Hanoi puzzle, where models purportedly failed when complexity increased, leading to the conclusion that these models possess limited problem-solving abilities. However, these conclusions are now under scrutiny, with critics arguing the experiments were not fairly designed.

Alex Lawsen of Open Philanthropy has published a counter-study challenging the foundations of Apple's claims. Lawsen argues that models like Claude, Gemini, and OpenAI's latest systems weren't failing due to cognitive limits, but rather because the evaluation methods didn't account for key technical constraints. One issue raised was that models were often cut off from providing full answers because they neared their maximum token limit, a built-in cap on output text, which Apple's evaluation counted as a reasoning failure rather than a practical limitation.

Another point of contention involved the River Crossing test, where models faced unsolvable problem setups. When the models correctly identified the tasks as impossible and refused to attempt them, they were still marked wrong. Furthermore, the evaluation system strictly judged outputs against exhaustive solutions, failing to credit models for partial but correct answers, pattern recognition, or strategic shortcuts. To illustrate, Lawsen demonstrated that when models were instructed to write a program to solve the Hanoi puzzle, they delivered accurate, scalable solutions even with 15 disks, contradicting Apple's assertion of limitations.

Recommended read:
References :
  • chatgptiseatingtheworld.com: Research: Did Apple researchers overstate “The Illusion of Thinking†in reasoning models. Opus, Lawsen think so.
  • Digital Information World: Apple’s AI Critique Faces Pushback Over Flawed Testing Methods
  • NextBigFuture.com: Apple Researcher Claims Illusion of AI Thinking Versus OpenAI Solving Ten Disk Puzzle
  • Bernard Marr: Beyond The Hype: What Apple's AI Warning Means For Business Leaders

Emilia David@AI News | VentureBeat //
Google's Gemini 2.5 Pro is making waves in the AI landscape, with claims of superior coding performance compared to leading models like DeepSeek R1 and Grok 3 Beta. The updated Gemini 2.5 Pro, currently in preview, is touted to deliver faster and more creative responses, particularly in coding and reasoning tasks. Google highlighted improvements across key benchmarks such as AIDER Polyglot, GPQA, and HLE, noting a significant Elo score jump since the previous version. This newest iteration, referred to as Gemini 2.5 Pro Preview 06-05, builds upon the I/O edition released earlier in May, promising even better performance and enterprise-scale capabilities.

Google is also planning several enhancements to the Gemini platform. These include upgrades to Canvas, Gemini’s workspace for organizing and presenting ideas, adding the ability to auto-generate infographics, timelines, mindmaps, full presentations, and web pages. There are also plans to integrate Imagen 4, which enhances image generation capabilities, image-to-video functionality, and an Enterprise mode, which offers a dedicated toggle to separate professional and personal workflows. This Enterprise mode aims to provide business users with clearer boundaries and improved data governance within the platform.

In addition to its coding prowess, Gemini 2.5 Pro boasts native audio capabilities, enabling developers to build richer and more interactive applications. Google emphasizes its proactive approach to safety and responsibility, embedding SynthID watermarking technology in all audio outputs to ensure transparency and identifiability of AI-generated audio. Developers can explore these native audio features through the Gemini API in Google AI Studio or Vertex AI, experimenting with audio dialog and controllable speech generation. Google DeepMind is also exploring ways for AI to take over mundane email chores, with CEO Demis Hassabis envisioning an AI assistant capable of sorting, organizing, and responding to emails in a user's own voice and style.

Recommended read:
References :
  • AI News | VentureBeat: Google claims Gemini 2.5 Pro preview beats DeepSeek R1 and Grok 3 Beta in coding performance
  • learn.aisingapore.org: Gemini 2.5’s native audio capabilities
  • Kyle Wiggers ?: Google says its updated Gemini 2.5 Pro AI model is better at coding
  • www.techradar.com: Google upgrades Gemini 2.5 Pro's already formidable coding abilities
  • SiliconANGLE: Google revamps Gemini 2.5 Pro again, claiming superiority in coding and math
  • siliconangle.com: SiliconAngle reports on Google's release of an updated Gemini 2.5 Pro model, highlighting its claimed superiority in coding and math.
  • www.marktechpost.com: A Comprehensive Coding Tutorial for Advanced SerpAPI Integration with Google Gemini-1.5-Flash for Advanced Analytics
  • Maginative: Google Just Quietly Upgrated Gemini 2.5 Pro
  • MarkTechPost: In this tutorial, we’ll learn how to harness the power of Google’s Gemini models alongside the flexibility of Pandas.

Tulsee Doshi@The Official Google Blog //
Google has launched an upgraded preview of Gemini 2.5 Pro, touting it as their most intelligent model yet. Building upon the version revealed in May, this updated AI demonstrates significant improvements in coding capabilities. One striking example of its advanced functionality is its ability to generate intricate images, such as a "pretty solid pelican riding a bicycle."

The model's enhanced coding proficiency is further highlighted by its ethical safeguards. When prompted to run SnitchBench, a tool designed to test the ethical boundaries of AI models, Gemini 2.5 Pro notably "tipped off both the feds and the WSJ and NYTimes." This self-awareness and alert system underscore the advancements in AI safety protocols integrated into the new model.

The rapid development and release of Gemini 2.5 Pro reflect Google's increasing confidence in its AI technology. The company emphasizes that this iteration offers substantial improvements over its predecessors, solidifying its position as a leading AI model. Developers and enthusiasts alike are encouraged to try the latest Gemini 2.5 Pro before its general release to experience its improved capabilities firsthand.

Recommended read:
References :
  • Kyle Wiggers ?: Google says its updated Gemini 2.5 Pro AI model is better at coding
  • The Official Google Blog: We’re introducing an upgraded preview of Gemini 2.5 Pro, our most intelligent model yet. Building on the version we released in May and showed at I/O, this model will be…
  • THE DECODER: Google has rolled out another update to its flagship AI model, Gemini 2.5 Pro. The latest version brings modest improvements across a range of benchmarks and maintains top positions on tests like LMArena and WebDevArena The article appeared first on .
  • Latest news: The flagship model's rapid evolution reflects Google's growing confidence in its AI offerings.
  • bsky.app: New Gemini 2.5 Pro is out - gemini-2.5-pro-preview-06-05 It made me a pretty solid pelican riding a bicycle, AND it tipped off both the feds and the WSJ and NYTimes when I tried running SnitchBench against it https://simonwillison.net/2025/Jun/5/gemini-25-pro-preview-06-05/
  • Simon Willison: New Gemini 2.5 Pro is out - gemini-2.5-pro-preview-06-05 It made me a pretty solid pelican riding a bicycle, AND it tipped off both the feds and the WSJ and NYTimes when I tried running SnitchBench against it
  • AI News | VentureBeat: Google claims Gemini 2.5 Pro preview beats DeepSeek R1 and Grok 3 Beta in coding performance
  • www.techradar.com: Google upgrades Gemini 2.5 Pro's already formidable coding abilities
  • SiliconANGLE: Google revamps Gemini 2.5 Pro again, claiming superiority in coding and math
  • siliconangle.com: Google revamps Gemini 2.5 Pro again, claiming superiority in coding and math
  • the-decoder.com: Google Rolls Out Modest Improvements to Gemini 2.5 Pro
  • www.marktechpost.com: Google Introduces Open-Source Full-Stack AI Agent Stack Using Gemini 2.5 and LangGraph for Multi-Step Web Search, Reflection, and Synthesis
  • Maginative: Maginative article about how Google quietly upgraded Gemini 2.5 Pro.
  • Stack Overflow Blog: Ryan and Ben welcome Tulsee Doshi and Logan Kilpatrick from Google's DeepMind to discuss the advanced capabilities of the new Gemini 2.5, the importance of feedback loops for model improvement and reducing hallucinations, the necessity of great data for advancements, and enhancing developer experience through tool integration.

@www.linkedin.com //
Nvidia's Blackwell GPUs have achieved top rankings in the latest MLPerf Training v5.0 benchmarks, demonstrating breakthrough performance across various AI workloads. The NVIDIA AI platform delivered the highest performance at scale on every benchmark, including the most challenging large language model (LLM) test, Llama 3.1 405B pretraining. Nvidia was the only vendor to submit results on all MLPerf Training v5.0 benchmarks, highlighting the versatility of the NVIDIA platform across a wide array of AI workloads, including LLMs, recommendation systems, multimodal LLMs, object detection, and graph neural networks.

The at-scale submissions used two AI supercomputers powered by the NVIDIA Blackwell platform: Tyche, built using NVIDIA GB200 NVL72 rack-scale systems, and Nyx, based on NVIDIA DGX B200 systems. Nvidia collaborated with CoreWeave and IBM to submit GB200 NVL72 results using a total of 2,496 Blackwell GPUs and 1,248 NVIDIA Grace CPUs. The GB200 NVL72 systems achieved 90% scaling efficiency up to 2,496 GPUs, improving time-to-convergence by up to 2.6x compared to Hopper-generation H100.

The new MLPerf Training v5.0 benchmark suite introduces a pretraining benchmark based on the Llama 3.1 405B generative AI system, the largest model to be introduced in the training benchmark suite. On this benchmark, Blackwell delivered 2.2x greater performance compared with the previous-generation architecture at the same scale. Furthermore, on the Llama 2 70B LoRA fine-tuning benchmark, NVIDIA DGX B200 systems, powered by eight Blackwell GPUs, delivered 2.5x more performance compared with a submission using the same number of GPUs in the prior round. These performance gains highlight advancements in the Blackwell architecture and software stack, including high-density liquid-cooled racks, fifth-generation NVLink and NVLink Switch interconnect technologies, and NVIDIA Quantum-2 InfiniBand networking.

Recommended read:
References :
  • NVIDIA Newsroom: NVIDIA Blackwell Delivers Breakthrough Performance in Latest MLPerf Training Results
  • NVIDIA Technical Blog: NVIDIA Blackwell Delivers up to 2.6x Higher Performance in MLPerf Training v5.0
  • IEEE Spectrum: Nvidia’s Blackwell Conquers Largest LLM Training Benchmark
  • NVIDIA Technical Blog: Reproducing NVIDIA MLPerf v5.0 Training Scores for LLM Benchmarks
  • AI News | VentureBeat: Nvidia says its Blackwell chips lead benchmarks in training AI LLMs
  • blogs.nvidia.com: NVIDIA RTX Blackwell GPUs Accelerate Professional-Grade Video Editing
  • MLCommons: New MLCommons MLPerf Training v5.0 Benchmark Results Reflect Rapid Growth and Evolution of the Field of AI
  • www.aiwire.net: MLPerf Training v5.0 results show Nvidia’s Blackwell GB200 accelerators sprinting through record time-to-train scores.
  • blogs.nvidia.com: NVIDIA is working with companies worldwide to build out AI factories — speeding the training and deployment of next-generation AI applications that use the latest advancements in training and inference. The NVIDIA Blackwell architecture is built to meet the heightened performance requirements of these new applications. In the latest round of MLPerf Training — the
  • mlcommons.org: New MLCommons MLPerf Training v5.0 Benchmark Results Reflect Rapid Growth and Evolution of the Field of AI
  • NVIDIA Newsroom: NVIDIA RTX Blackwell GPUs Accelerate Professional-Grade Video Editing
  • ServeTheHome: The new MLPerf Training v5.0 are dominated by NVIDIA Blackwell and Hopper results, but we also get AMD Instinct MI325X on a benchmark as well
  • AIwire: This is a news article on nvidia Blackwell GPUs lift Nvidia to the top of MLPerf Training Rankings
  • IEEE Spectrum: Nvidia’s Blackwell Conquers Largest LLM Training Benchmark
  • www.servethehome.com: MLPerf Training v5.0 is Out

@www.quantamagazine.org //
Researchers are making strides in AI reasoning and efficiency, tackling both complex problem-solving and the energy consumption of these systems. One promising area involves reversible computing, where programs can run backward as easily as forward, theoretically saving energy by avoiding data deletion. Michael Frank, a researcher interested in the physical limits of computation, discovered that reversible computing could keep computational progress going as traditional computing slows due to physical limitations. Christof Teuscher at Portland State University emphasized the potential for significant power savings with this approach.

An evolution of the LLM-as-a-Judge paradigm is emerging. Meta AI has introduced the J1 framework which shifts the paradigm of LLMs from passive generators to active, deliberative evaluators through self-evaluation. This approach, detailed in "J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning," addresses the growing need for rigorous and scalable evaluation as AI systems become more capable and widely deployed. By reframing judgment as a structured reasoning task trained through reinforcement learning, J1 aims to create models that perform consistent, interpretable, and high-fidelity evaluations.

Soheil Feizi, an associate professor at the University of Maryland, has received a $1 million federal grant to advance foundational research in reasoning AI models. This funding, stemming from a Presidential Early Career Award for Scientists and Engineers (PECASE), will support his work in defending large language models (LLMs) against attacks, identifying weaknesses in how these models learn, encouraging transparent, step-by-step logic, and understanding the "reasoning tokens" that drive decision-making. Feizi plans to explore innovative approaches like live activation probing and novel reinforcement-learning designs, aiming to transform theoretical advancements into practical applications and real-world usages.

Recommended read:
References :

@www.marktechpost.com //
Large Language Models (LLMs) are facing significant challenges in handling real-world conversations, particularly those involving multiple turns and underspecified tasks. Researchers from Microsoft and Salesforce have recently revealed a substantial performance drop of 39% in LLMs when confronted with such conversational scenarios. This decline highlights the difficulty these models have in maintaining contextual coherence and delivering accurate outcomes as conversations evolve and new information is incrementally introduced. Instead of flexibly adjusting to changing user inputs, LLMs often make premature assumptions, leading to errors that persist throughout the dialogue.

These findings underscore a critical gap in how LLMs are currently evaluated. Traditional benchmarks often rely on single-turn, fully-specified prompts, which fail to capture the complexities of real-world interactions where information is fragmented and context must be actively constructed from multiple exchanges. This discrepancy between evaluation methods and actual conversational demands contributes to the challenges LLMs face in integrating underspecified inputs and adapting to evolving user needs. The research emphasizes the need for new evaluation frameworks that better reflect the dynamic and iterative nature of real-world conversations.

In contrast to these challenges, Google's DeepMind has developed AlphaEvolve, an AI agent designed to optimize code and reclaim computational resources. AlphaEvolve autonomously rewrites critical code, resulting in a 0.7% reduction in Google's overall compute usage. This system not only pays for itself but also demonstrates the potential for AI agents to significantly improve efficiency in complex computational environments. AlphaEvolve's architecture, featuring a controller, fast-draft models, deep-thinking models, automated evaluators, and versioned memory, represents a production-grade approach to agent engineering. This allows for continuous improvement at scale.

Recommended read:
References :
  • AI News | VentureBeat: Google’s AlphaEvolve: The AI agent that reclaimed 0.7% of Google’s compute – and how to copy it.
  • MarkTechPost: LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified Tasks.