News from the AI & ML world

DeeperML - #aiinference

Noah Kravitz@NVIDIA Blog //
NVIDIA is making strides in both agentic AI and open-source initiatives. Jacob Liberman, director of product management at NVIDIA, explains how agentic AI bridges the gap between powerful AI models and practical enterprise applications. Enterprises are now deploying AI agents to free human workers from time-consuming and error-prone tasks, allowing them to focus on high-value work that requires creativity and strategic thinking. NVIDIA AI Blueprints help enterprises build their own AI agents.

NVIDIA has announced the open-source release of the KAI Scheduler, a Kubernetes-native GPU scheduling solution, now available under the Apache 2.0 license. Originally developed within the Run:ai platform, the KAI Scheduler is now available to the community while also continuing to be packaged and delivered as part of the NVIDIA Run:ai platform. The KAI Scheduler is designed to optimize the scheduling of GPU resources and tackle challenges associated with managing AI workloads on GPUs and CPUs.

Recommended read:
References :
  • NVIDIA Newsroom: Speed Demon: NVIDIA Blackwell Takes Pole Position in Latest MLPerf Inference Results
  • NVIDIA Technical Blog: The compute demands for large language model (LLM) inference are growing rapidly, fueled by the combination of growing model sizes, real-time latency...
  • IEEE Spectrum: Nvidia Blackwell Ahead in AI Inference, AMD Second
  • insideAI News: MLCommons Releases MLPerf Inference v5.0 Benchmark Results
  • insidehpc.com: MLCommons Releases New MLPerf Inference v5.0 Benchmark Results
  • www.networkworld.com: Nvidia’s Blackwell raises the bar with new MLPerf Inference V5.0 results
  • NVIDIA Newsroom: AI is rapidly transforming how organizations solve complex challenges. The early stages of enterprise AI adoption focused on using large language models to create chatbots. Now, enterprises are using agentic AI to create intelligent systems that reason, act and execute complex tasks with a degree of autonomy.
  • NVIDIA Technical Blog: Today, NVIDIA announced the open-source release of the KAI Scheduler, a Kubernetes-native GPU scheduling solution, now available under the Apache 2.0 license.
  • insideAI News: Today, NVIDIA posted a blog announcing the open-source release of the KAI Scheduler, a Kubernetes-native GPU scheduling solution, now available under the Apache 2.0 license.
  • Developer Tech News: NVIDIA has open-sourced KAI Scheduler, a Kubernetes solution designed to optimise the scheduling of GPU resources.
  • ServeTheHome: MLPerf Inference v5.0 Results Released

@tomshardware.com //
Nvidia has unveiled its next-generation data center GPU, the Blackwell Ultra, at its GTC event in San Jose. Expanding on the Blackwell architecture, the Blackwell Ultra GPU will be integrated into the DGX GB300 and DGX B300 systems. The DGX GB300 system, designed with a rack-scale, liquid-cooled architecture, is powered by the Grace Blackwell Ultra Superchip, combining 36 NVIDIA Grace CPUs and 72 NVIDIA Blackwell Ultra GPUs. Nvidia officially revealed its Blackwell Ultra B300 data center GPU, which packs up to 288GB of HBM3e memory and offers 1.5X the compute potential of the existing B200 solution.

The Blackwell Ultra GPU promises a 70x speedup in AI inference and reasoning compared to the previous Hopper-based generation. This improvement is achieved through hardware and networking advancements in the DGX GB300 system. Blackwell Ultra is designed to meet the demand for test-time scaling inference with a 1.5X increase in the FP4 compute. Nvidia's CEO, Jensen Huang, suggests that the new Blackwell chips render the previous generation obsolete, emphasizing the significant leap forward in AI infrastructure.

Recommended read:
References :
  • AIwire: Nvidia’s DGX AI Systems Are Faster and Smarter Than Ever
  • www.tomshardware.com: Nvidia officially revealed its Blackwell Ultra B300 data center GPU, which packs up to 288GB of HBM3e memory and offers 1.5X the compute potential of the existing B200 solution.
  • BigDATAwire: Nvidia's GTC 2025 conference showcased the new Blackwell Ultra GPUs and updates to its AI infrastructure portfolio.
  • www.laptopmag.com: Blackwell Ultra and Rubin Ultra are Nvidia's newest additions to the growing list of AI superchips
  • BigDATAwire: Nvidia used its GTC conference today to introduce new GPU superchips, including the second generation of its current Grace Blackwell chip, as well as the next generation, dubbed the Vera The post appeared first on .
  • venturebeat.com: Nvidia's GTC 2025 keynote highlighted advancements in AI infrastructure, featuring the Blackwell Ultra GB300 chips.
  • Analytics Vidhya: An overview of Nvidia's GTC 2025 announcements, including new GPUs and advancements in AI hardware.
  • AI News: NVIDIA Dynamo: Scaling AI inference with open-source efficiency
  • www.tomshardware.com: Nvidia unveils DGX Station workstation PCs with GB300 Blackwell Ultra inside
  • BigDATAwire: Nvidia Preps for 100x Surge in Inference Workloads, Thanks to Reasoning AI Agents
  • Data Phoenix: Nvidia introduces the Blackwell Ultra to support the rise of AI reasoning, agents, and physical AI
  • The Next Platform: This article discusses Nvidia's new advancements in AI, and how the company is looking to capture market share and the challenges they face.

Harsh Mishra@Analytics Vidhya //
DeepSeek AI has been making significant contributions to the open-source community, particularly in the realm of AI model efficiency and accessibility. They recently launched the Fire-Flyer File System (3FS), a high-performance distributed file system tailored for AI training and inference workloads. This system is designed to address the challenges of managing large-scale, concurrent data access, a common bottleneck in traditional file systems. 3FS leverages modern SSDs and RDMA networks, offering a shared storage layer that facilitates the development of distributed applications by bypassing limitations seen in more traditional, locality-dependent file systems.

DeepSeek's commitment extends to data processing and model optimization. They have introduced the Smallpond framework for data processing and released quantized DeepSeek-R1 models, optimized for deployment-ready reasoning tasks. The quantized models, including Llama-8B, Llama-70B, Qwen-1.5B, Qwen-7B, Qwen-14B, and Qwen-32B, are available as a Hugging Face collection with evaluations, benchmarks, and setup instructions. These models maintain competitive reasoning accuracy while unlocking significant inference speedups.

Recommended read:
References :
  • Analytics Vidhya: DeepSeek #OpenSourceWeek Day 5: Launch of 3FS and Smallpond Framework
  • MarkTechPost: DeepSeek AI Releases Fire-Flyer File System (3FS): A High-Performance Distributed File System Designed to Address the Challenges of AI Training and Inference Workload
  • Neural Magic: Quantized DeepSeek-R1 Models: Deployment-Ready Reasoning Models
  • MarkTechPost: DeepSeek AI Releases Smallpond: A Lightweight Data Processing Framework Built on DuckDB and 3FS
  • www.itpro.com: ‘Awesome for the community’: DeepSeek open sourced its code repositories, and experts think it could give competitors a scare

staff@insideAI News //
References: insideAI News , AIwire , insidehpc.com ...
MLCommons has released the latest MLPerf Inference v5.0 benchmark results, highlighting the growing importance of generative AI in the machine learning landscape. The new benchmarks feature tests for large language models (LLMs) like Llama 3.1 405B and Llama 2 70B Interactive, designed to evaluate how well systems perform in real-world applications requiring agentic reasoning and low-latency responses. This shift reflects the industry's increasing focus on deploying generative AI and the need for hardware and software optimized for these demanding workloads.

The v5.0 results reveal significant performance improvements driven by advancements in both hardware and software. The median submitted score for Llama 2 70B has doubled compared to a year ago, and the best score is 3.3 times faster than Inference v4.0. These gains are attributed to innovations like support for lower-precision computation formats such as FP4, which allows for more efficient processing of large models. The MLPerf Inference benchmark suite evaluates machine learning performance in a way that is architecture-neutral, reproducible, and representative of real-world workloads.

Recommended read:
References :
  • insideAI News: Today, MLCommons announced new results for its MLPerf Inference v5.0 benchmark suite, which delivers machine learning (ML) system performance benchmarking.
  • AIwire: MLPerf v5.0 Reflects the Shift Toward Reasoning in AI Inference
  • ServeTheHome: MLPerf Inference v5.0 Results Released
  • insidehpc.com: MLCommons Releases MLPerf Inference v5.0 Benchmark Results
  • www.networkworld.com: New MLCommons benchmarks to test AI infrastructure performance

Ryan Daws@AI News //
References: AI News , BigDATAwire , NVIDIA Newsroom ...
NVIDIA has launched Dynamo, an open-source inference software, designed to accelerate and scale reasoning models within AI factories. Dynamo succeeds the NVIDIA Triton Inference Server, representing a new generation of AI inference software specifically engineered to maximize token revenue generation for AI factories deploying reasoning AI models. The software orchestrates and accelerates inference communication across thousands of GPUs, utilizing disaggregated serving.

Dynamo optimizes AI factories by dynamically managing GPU resources in real-time to adapt to request volumes. Dynamo’s intelligent inference optimizations have shown to boost the number of tokens generated by over 30 times per GPU and has demonstrated the ability to double the performance and revenue of AI factories serving Llama models on NVIDIA’s current Hopper platform.

Recommended read:
References :
  • AI News: NVIDIA Dynamo: Scaling AI inference with open-source efficiency
  • BigDATAwire: At its GTC event in San Jose today, Nvidia unveiled updates to its AI infrastructure portfolio, including its next-generation datacenter GPU, the NVIDIA Blackwell Ultra.
  • AIwire: Nvidia’s DGX AI Systems Are Faster and Smarter Than Ever
  • NVIDIA Newsroom: NVIDIA Blackwell Powers Real-Time AI for Entertainment Workflows
  • MarkTechPost: Details the Open Sourcing of Dynamo

Jaime Hampton@AIwire //
References: venturebeat.com , AIwire ,
Cerebras Systems is expanding its role in AI inference with a new partnership with Hugging Face and the launch of six new AI datacenters across North America and Europe. The partnership with Hugging Face integrates Cerebras' inference capabilities into the Hugging Face Hub, granting access to the platform's five million developers. This integration allows developers to use Cerebras as their inference provider for models like Llama 3.3 70B, powered by the Cerebras CS-3 systems.

Cerebras is also launching six new AI inference datacenters located across North America and Europe. Once fully operational, these centers are expected to significantly increase Cerebras' capacity to handle high-speed inference workloads, supporting over 40 million Llama 70B tokens per second. The expansion includes facilities in Dallas, Minneapolis, Oklahoma City, Montreal, New York and France, with 85% of the total capacity located in the United States.

Recommended read:
References :
  • venturebeat.com: Cerebras just announced 6 new AI datacenters that process 40M tokens per second — and it could be bad news for Nvidia
  • AIwire: Cerebras Scales AI Inference with Hugging Face Partnership and Datacenter Expansion
  • THE DECODER: Nvidia rival Cerebras opens six data centers for rapid AI inference

Ben Lorica@Gradient Flow //
References: Gradient Flow , bdtechtalks.com ,
Nvidia's Dynamo is a new open-source framework designed to tackle the complexities of scaling AI inference operations. Dynamo optimizes how large language models operate across multiple GPUs, balancing individual performance with system-wide throughput. Introduced at the GPU Technology Conference, Nvidia CEO Jensen Huang has described it as "the operating system of an AI factory".

This framework includes components designed to function as an "air traffic control system" for AI processing. These key components include libraries like TensorRT-LLM and SGLang, which provide efficient mechanisms for handling token generation, memory management, and batch processing to improve throughput and reduce latency when serving AI models. Nvidia's nGPT combines transformers and state-space models to reduce costs and increase speed while maintaining accuracy.

Recommended read:
References :
  • Gradient Flow: Diving into Nvidia Dynamo: AI Inference at Scale
  • bdtechtalks.com: Nvidia’s Hymba is an efficient SLM that combines state-space models and transformers
  • MarkTechPost: NVIDIA AI Researchers Introduce FFN Fusion: A Novel Optimization Technique that Demonstrates How Sequential Computation in Large Language Models LLMs can be Effectively Parallelized