News from the AI & ML world

DeeperML - #aiinference

staff@insideAI News //
NVIDIA has reportedly broken a new AI world record, achieving over 1,000 tokens per second (TPS) per user with Meta's Llama 4 Maverick large language model. This breakthrough was accomplished using NVIDIA's DGX B200 node, which is equipped with eight Blackwell GPUs. The performance was independently measured by the AI benchmarking service Artificial Analysis. NVIDIA's Blackwell architecture offers substantial improvements in processing power, which enables faster inference times for large language models.

This record-breaking result was achieved through extensive software optimizations, including the use of TensorRT and the training of a speculative decoding draft model using EAGLE-3 techniques. These optimizations alone resulted in a 4x performance increase compared to Blackwell's previous best results. NVIDIA also leveraged FP8 data types for GEMMs, Mixture of Experts (MoE), and Attention operations to reduce model size and capitalize on Blackwell Tensor Core technology's high FP8 throughput. The company claims that accuracy when using the FP8 data format matches that of Artificial Analysis BF16 across many metrics.

NVIDIA reports that Blackwell reaches 72,000 TPS/server at its highest throughput configuration. NVIDIA says it achieved a 4x speed-up relative to the best prior Blackwell baseline by using TensorRT-LLM and training a speculative decoding draft model using EAGLE-3 techniques. This milestone underscores the considerable progress made in AI inference capabilities through NVIDIA's hardware and software innovations, thereby clearing the way for more efficient and responsive AI applications.

Recommended read:
References :
  • insideAI News: Details of AI Inference: NVIDIA Reports Blackwell Surpasses 1000 TPS/User Barrier with Llama 4 Maverick
  • insidehpc.com: Details on NVIDIA Reports Blackwell Surpasses 1000 TPS/User Barrier with Meta’s Llama 4 Maverick
  • www.tomshardware.com: Reports that Nvidia has broken another AI world record, breaking over 1,000 TPS/user with a DGX B200 node boasting eight Blackwell GPUs inside.
  • insidehpc.com: NVIDIA said it has achieved a record large language model (LLM) inference speed, announcing that an NVIDIA DGX B200 node with eight NVIDIA Blackwell GPUs achieved more than 1,000 tokens per second (TPS) per user on the 400-billion-parameter Llama 4 Maverick model.
  • NVIDIA Technical Blog: NVIDIA has achieved a world-record large language model (LLM) inference speed. A single NVIDIA DGX B200 node with eight NVIDIA Blackwell GPUs can achieve over...
  • analyticsindiamag.com: Texas Instruments (TI) has announced a collaboration with NVIDIA to develop new power management and sensing technology aimed at supporting future high-voltage power systems in AI data centres.
  • www.servethehome.com: The Intel Xeon 6 with priority cores wins big at NVIDIA but there is a lot more going on in the release than meets the eye

@cloud.google.com //
References: Compute , github.com , cloud.google.com ...
Google Cloud is enhancing its AI Hypercomputer to accelerate AI inference workloads, focusing on maximizing performance and reducing costs for generative AI applications. At Google Cloud Next 25, updates to AI Hypercomputer's inference capabilities were shared, showcasing Google's newest Tensor Processing Unit (TPU) called Ironwood, designed for inference. Software enhancements include simple and performant inference using vLLM on TPU and the latest GKE inference capabilities such as GKE Inference Gateway and GKE Inference Quickstart. Google is paving the way for the next phase of AI's rapid evolution with the AI Hypercomputer.

Google's JetStream inference engine incorporates new performance optimizations, integrating Pathways for ultra-low latency multi-host, disaggregated serving. The sixth-generation Trillium TPU exceeds throughput performance by 2.9x for Llama 2 70B and 2.8x for Mixtral 8x7B compared to TPU v5e. Google’s JAX inference engine maximizes performance and reduces inference costs by offering more choice when serving LLMs on TPU. JetStream throughput is improved, achieving 1703 token/s on Llama 3.1 405B on Trillium.

Google is also intensifying its efforts to combat online scams by integrating artificial intelligence across Search, Chrome, and Android. AI is central to Google's anti-scam strategy, blocking hundreds of millions of scam results daily and identifying more fraudulent pages. Gemini Nano provides instant detection of high-risk websites, helping counter new and evolving scams across platforms. Google has long used AI to detect and block scams, including fake tech support, fraudulent financial services, and phishing links. Recent updates to AI classifiers now allow the company to detect 20 times more scam pages, improving the quality of search results by reducing exposure to harmful sites.

Recommended read:
References :
  • Compute: From LLMs to image generation: Accelerate inference workloads with AI Hypercomputer
  • github.com: To maximize performance and reduce inference costs, we are excited to offer more choice when serving LLMs on TPU, further enhancing JetStream and bringing , a widely-adopted fast and efficient library for serving LLMs.
  • blog.google: blog.google
  • cloud.google.com: Accelerating AI inference with Google Cloud TPUs and GPUs

staff@insideAI News //
References: insideAI News , Ken Yeung
Meta is partnering with Cerebras to enhance AI inference speeds within Meta's new Llama API. This collaboration combines Meta's open-source Llama models with Cerebras' specialized inference technology, aiming to provide developers with significantly faster performance. According to Cerebras, developers building on the Llama 4 Cerebras model within the API can expect speeds up to 18 times quicker than traditional GPU-based solutions. This acceleration is expected to unlock new possibilities for building real-time and agentic AI applications, making complex tasks like low-latency voice interaction, interactive code generation, and real-time reasoning more feasible.

This partnership allows Cerebras to expand its reach to a broader developer audience, strengthening its existing relationship with Meta. Since launching its inference solutions in 2024, Cerebras has emphasized its ability to deliver rapid Llama inference, serving billions of tokens through its AI infrastructure. Andrew Feldman, CEO and co-founder of Cerebras, stated that the company is proud to make Llama API the fastest inference API available, empowering developers to create AI systems previously unattainable with GPU-based inference clouds. Independent benchmarks by Artificial Analysis support this claim, indicating that Cerebras achieves significantly higher token processing speeds compared to platforms like ChatGPT and DeepSeek.

Developers will have direct access to the enhanced Llama 4 inference by selecting Cerebras within the Llama API. Meta also continues to innovate with its AI app, testing new features such as "Reasoning" mode and "Voice Personalization," designed to enhance user interaction. The “Reasoning” feature could potentially offer more transparent explanations for the AI’s responses, while voice settings like "Focus on my voice" and "Welcome message" could offer more personalized audio interactions, especially relevant for Meta's hardware ambitions in areas such as smart glasses and augmented reality devices.

Recommended read:
References :
  • insideAI News: Meta has teamed with Cerebras on AI inference in Meta’s new Llama API, combining Meta’s open-source Llama models with inference technology from Cerebras.
  • Ken Yeung: IN THIS ISSUE: Meta hosts its first-ever event around its Llama model, launching a standalone app to take on Microsoft’s Copilot and ChatGPT.

staff@insideAI News //
Google Cloud has unveiled its seventh-generation Tensor Processing Unit (TPU), named Ironwood. This custom AI accelerator is purpose-built for inference, marking a shift in Google's AI chip development strategy. While previous TPUs handled both training and inference, Ironwood is designed to optimize the deployment of trained AI models for making predictions and generating responses. According to Google, Ironwood will allow for a new "age of inference" where AI agents proactively retrieve and generate data, delivering insights and answers rather than just raw data.

Ironwood boasts impressive technical specifications. When scaled to 9,216 chips per pod, it delivers 42.5 exaflops of computing power. Each chip has a peak compute of 4,614 teraflops, accompanied by 192GB of High Bandwidth Memory. The memory bandwidth reaches 7.2 terabits per second per chip. Google highlights that Ironwood delivers twice the performance per watt compared to its predecessor and is nearly 30 times more power-efficient than Google's first Cloud TPU from 2018.

The focus on inference highlights a pivotal shift in the AI landscape. The industry has seen extensive development of large foundation models, and Ironwood is designed to manage the computational demands of these complex "thinking models," including large language models and Mixture of Experts (MoEs). Its architecture includes a low-latency, high-bandwidth Inter-Chip Interconnect (ICI) network to support coordinated communication at full TPU pod scale. The new TPU scales up to 9,216 liquid-cooled chips. This innovation is aimed at applications requiring real-time processing and predictions, and promises higher intelligence at lower costs.

Recommended read:
References :
  • insidehpc.com: Google Cloud today introduced its seventh-generation Tensor Processing Unit, "Ironwood," which the company said is it most performant and scalable custom AI accelerator and the first designed specifically for inference.
  • www.bigdatawire.com: Google Cloud Preps for Agentic AI Era with ‘Ironwood’ TPU, New Models and Software
  • www.nextplatform.com: With “Ironwood†TPU, Google Pushes The AI Accelerator To The Floor
  • insideAI News: Google today introduced its seventh-generation Tensor Processing Unit, “Ironwood,†which the company said is it most performant and scalable custom AI accelerator and the first designed specifically for inference.
  • venturebeat.com: Google's new Ironwood chip is 24x more powerful than the world's fastest supercomputer.
  • BigDATAwire: Google Cloud Preps for Agentic AI Era with ‘Ironwood’ TPU, New Models and Software
  • insidehpc.com: Google Cloud today introduced its seventh-generation Tensor Processing Unit, "Ironwood," which the company said is it most performant and scalable custom AI accelerator and the first designed specifically for inference.
  • the-decoder.com: Google unveils new AI models, infrastructure, and agent protocol at Cloud Next
  • AI News | VentureBeat: Google’s new Agent Development Kit lets enterprises rapidly prototype and deploy AI agents without recoding
  • Compute: Introducing Ironwood TPUs and new innovations in AI Hypercomputer
  • The Next Platform: With “Ironwood†TPU, Google Pushes The AI Accelerator To The Floor
  • Ken Yeung: Google Pushes Agent Interoperability With New Dev Kit and Agent2Agent Standard
  • The Tech Basic: Details Google Cloud's New AI Chip.
  • insideAI News: Google today introduced its seventh-generation Tensor Processing Unit, "Ironwood," which the company said is it most performant and scalable custom AI accelerator and the first designed specifically for inference.
  • venturebeat.com: Google unveils Ironwood TPUs, Gemini 2.5 "thinking models," and Agent2Agent protocol at Cloud Next '25, challenging Microsoft and Amazon with a comprehensive AI strategy that enables multiple AI systems to work together across platforms.
  • www.marktechpost.com: Google AI Introduces Ironwood: A Google TPU Purpose-Built for the Age of Inference
  • cloud.google.com: Introducing Ironwood TPUs and new innovations in AI Hypercomputer
  • Kyle Wiggers ?: Ironwood is Google’s newest AI accelerator chip

staff@insideAI News //
References: insideAI News , AIwire , insidehpc.com ...
MLCommons has released the latest MLPerf Inference v5.0 benchmark results, highlighting the growing importance of generative AI in the machine learning landscape. The new benchmarks feature tests for large language models (LLMs) like Llama 3.1 405B and Llama 2 70B Interactive, designed to evaluate how well systems perform in real-world applications requiring agentic reasoning and low-latency responses. This shift reflects the industry's increasing focus on deploying generative AI and the need for hardware and software optimized for these demanding workloads.

The v5.0 results reveal significant performance improvements driven by advancements in both hardware and software. The median submitted score for Llama 2 70B has doubled compared to a year ago, and the best score is 3.3 times faster than Inference v4.0. These gains are attributed to innovations like support for lower-precision computation formats such as FP4, which allows for more efficient processing of large models. The MLPerf Inference benchmark suite evaluates machine learning performance in a way that is architecture-neutral, reproducible, and representative of real-world workloads.

Recommended read:
References :
  • insideAI News: Today, MLCommons announced new results for its MLPerf Inference v5.0 benchmark suite, which delivers machine learning (ML) system performance benchmarking. The rorganization said the esults highlight that the AI community is focusing on generative AI ....
  • AIwire: MLPerf v5.0 Reflects the Shift Toward Reasoning in AI Inference
  • ServeTheHome: The new MLPerf Inference v5.0 results are out with new submissions for configurations from NVIDIA, Intel Xeon, and AMD Instinct MI325X The post appeared first on .
  • insidehpc.com: MLCommons Releases MLPerf Inference v5.0 Benchmark Results
  • www.networkworld.com: New MLCommons benchmarks to test AI infrastructure performance
  • SLVIKI.ORG: MLCommons Launches Next-Gen AI Benchmarks to Test the Limits of Generative Intelligence

@www.intel.com //
NVIDIA's Blackwell platform has dominated the latest MLPerf Inference V5.0 benchmarks, showcasing significant performance improvements in AI reasoning. The NVIDIA GB200 NVL72 system, featuring 72 Blackwell GPUs, achieved up to 30x higher throughput on the Llama 3.1 405B benchmark compared to the NVIDIA H200 NVL8 submission. This was driven by more than triple the performance per GPU and a 9x larger NVIDIA NVLink interconnect domain. The latest MLPerf results reflect the shift toward reasoning in AI inference.

Alongside this achievement, NVIDIA is open-sourcing the KAI Scheduler, a Kubernetes GPU scheduling solution, as part of its commitment to open-source AI innovation. Previously a core component of the Run:ai platform, KAI Scheduler is now available under the Apache 2.0 license. This solution is designed to address the unique challenges of managing AI workloads that utilize both GPUs and CPUs. According to NVIDIA, this will help in managing fluctuating GPU demands, which traditional resource schedulers struggle to handle.

Recommended read:
References :
  • NVIDIA Newsroom: Speed Demon: NVIDIA Blackwell Takes Pole Position in Latest MLPerf Inference Results
  • IEEE Spectrum: Nvidia Blackwell Ahead in AI Inference, AMD Second
  • AIwire: MLPerf v5.0 Reflects the Shift Toward Reasoning in AI Inference
  • Developer Tech News: KAI Scheduler: NVIDIA open-sources Kubernetes GPU scheduler

@www.intel.com //
NVIDIA is making strides in both agentic AI and open-source initiatives. Jacob Liberman, director of product management at NVIDIA, explains how agentic AI bridges the gap between powerful AI models and practical enterprise applications. Enterprises are now deploying AI agents to free human workers from time-consuming and error-prone tasks, allowing them to focus on high-value work that requires creativity and strategic thinking. NVIDIA AI Blueprints help enterprises build their own AI agents.

NVIDIA has announced the open-source release of the KAI Scheduler, a Kubernetes-native GPU scheduling solution, now available under the Apache 2.0 license. Originally developed within the Run:ai platform, the KAI Scheduler is now available to the community while also continuing to be packaged and delivered as part of the NVIDIA Run:ai platform. The KAI Scheduler is designed to optimize the scheduling of GPU resources and tackle challenges associated with managing AI workloads on GPUs and CPUs.

Recommended read:
References :
  • NVIDIA Newsroom: Speed Demon: NVIDIA Blackwell Takes Pole Position in Latest MLPerf Inference Results
  • NVIDIA Technical Blog: The compute demands for large language model (LLM) inference are growing rapidly, fueled by the combination of growing model sizes, real-time latency...
  • IEEE Spectrum: Nvidia Blackwell Ahead in AI Inference, AMD Second
  • insidehpc.com: MLCommons Releases New MLPerf Inference v5.0 Benchmark Results
  • www.networkworld.com: Nvidia’s Blackwell raises the bar with new MLPerf Inference V5.0 results
  • NVIDIA Newsroom: AI is rapidly transforming how organizations solve complex challenges. The early stages of enterprise AI adoption focused on using large language models to create chatbots. Now, enterprises are using agentic AI to create intelligent systems that reason, act and execute complex tasks with a degree of autonomy.
  • NVIDIA Technical Blog: Today, NVIDIA announced the open-source release of the KAI Scheduler, a Kubernetes-native GPU scheduling solution, now available under the Apache 2.0 license.
  • insideAI News: Today, NVIDIA posted a blog announcing the open-source release of the KAI Scheduler, a Kubernetes-native GPU scheduling solution, now available under the Apache 2.0 license.
  • Developer Tech News: NVIDIA has open-sourced KAI Scheduler, a Kubernetes solution designed to optimise the scheduling of GPU resources.

Ben Lorica@Gradient Flow //
References: Gradient Flow , bdtechtalks.com ,
Nvidia's Dynamo is a new open-source framework designed to tackle the complexities of scaling AI inference operations. Dynamo optimizes how large language models operate across multiple GPUs, balancing individual performance with system-wide throughput. Introduced at the GPU Technology Conference, Nvidia CEO Jensen Huang has described it as "the operating system of an AI factory".

This framework includes components designed to function as an "air traffic control system" for AI processing. These key components include libraries like TensorRT-LLM and SGLang, which provide efficient mechanisms for handling token generation, memory management, and batch processing to improve throughput and reduce latency when serving AI models. Nvidia's nGPT combines transformers and state-space models to reduce costs and increase speed while maintaining accuracy.

Recommended read:
References :
  • Gradient Flow: Diving into Nvidia Dynamo: AI Inference at Scale
  • bdtechtalks.com: Nvidia’s Hymba is an efficient SLM that combines state-space models and transformers
  • MarkTechPost: NVIDIA AI Researchers Introduce FFN Fusion: A Novel Optimization Technique that Demonstrates How Sequential Computation in Large Language Models LLMs can be Effectively Parallelized

Ryan Daws@AI News //
References: AI News , BigDATAwire , NVIDIA Newsroom ...
NVIDIA has launched Dynamo, an open-source inference software, designed to accelerate and scale reasoning models within AI factories. Dynamo succeeds the NVIDIA Triton Inference Server, representing a new generation of AI inference software specifically engineered to maximize token revenue generation for AI factories deploying reasoning AI models. The software orchestrates and accelerates inference communication across thousands of GPUs, utilizing disaggregated serving.

Dynamo optimizes AI factories by dynamically managing GPU resources in real-time to adapt to request volumes. Dynamo’s intelligent inference optimizations have shown to boost the number of tokens generated by over 30 times per GPU and has demonstrated the ability to double the performance and revenue of AI factories serving Llama models on NVIDIA’s current Hopper platform.

Recommended read:
References :
  • AI News: NVIDIA Dynamo: Scaling AI inference with open-source efficiency
  • BigDATAwire: At its GTC event in San Jose today, Nvidia unveiled updates to its AI infrastructure portfolio, including its next-generation datacenter GPU, the NVIDIA Blackwell Ultra.
  • AIwire: Nvidia’s DGX AI Systems Are Faster and Smarter Than Ever
  • NVIDIA Newsroom: NVIDIA Blackwell Powers Real-Time AI for Entertainment Workflows
  • MarkTechPost: Details the Open Sourcing of Dynamo

@tomshardware.com //
Nvidia has unveiled its next-generation data center GPU, the Blackwell Ultra, at its GTC event in San Jose. Expanding on the Blackwell architecture, the Blackwell Ultra GPU will be integrated into the DGX GB300 and DGX B300 systems. The DGX GB300 system, designed with a rack-scale, liquid-cooled architecture, is powered by the Grace Blackwell Ultra Superchip, combining 36 NVIDIA Grace CPUs and 72 NVIDIA Blackwell Ultra GPUs. Nvidia officially revealed its Blackwell Ultra B300 data center GPU, which packs up to 288GB of HBM3e memory and offers 1.5X the compute potential of the existing B200 solution.

The Blackwell Ultra GPU promises a 70x speedup in AI inference and reasoning compared to the previous Hopper-based generation. This improvement is achieved through hardware and networking advancements in the DGX GB300 system. Blackwell Ultra is designed to meet the demand for test-time scaling inference with a 1.5X increase in the FP4 compute. Nvidia's CEO, Jensen Huang, suggests that the new Blackwell chips render the previous generation obsolete, emphasizing the significant leap forward in AI infrastructure.

Recommended read:
References :
  • AIwire: Nvidia’s DGX AI Systems Are Faster and Smarter Than Ever
  • www.tomshardware.com: Nvidia officially revealed its Blackwell Ultra B300 data center GPU, which packs up to 288GB of HBM3e memory and offers 1.5X the compute potential of the existing B200 solution.
  • BigDATAwire: Nvidia's GTC 2025 conference showcased the new Blackwell Ultra GPUs and updates to its AI infrastructure portfolio.
  • www.laptopmag.com: Blackwell Ultra and Rubin Ultra are Nvidia's newest additions to the growing list of AI superchips
  • BigDATAwire: Nvidia used its GTC conference today to introduce new GPU superchips, including the second generation of its current Grace Blackwell chip, as well as the next generation, dubbed the Vera The post appeared first on .
  • venturebeat.com: Nvidia's GTC 2025 keynote highlighted advancements in AI infrastructure, featuring the Blackwell Ultra GB300 chips.
  • Analytics Vidhya: An overview of Nvidia's GTC 2025 announcements, including new GPUs and advancements in AI hardware.
  • AI News: NVIDIA Dynamo: Scaling AI inference with open-source efficiency
  • www.tomshardware.com: Nvidia unveils DGX Station workstation PCs with GB300 Blackwell Ultra inside
  • BigDATAwire: Nvidia Preps for 100x Surge in Inference Workloads, Thanks to Reasoning AI Agents
  • Data Phoenix: Nvidia introduces the Blackwell Ultra to support the rise of AI reasoning, agents, and physical AI
  • The Next Platform: This article discusses Nvidia's new advancements in AI, and how the company is looking to capture market share and the challenges they face.

Jaime Hampton@AIwire //
References: venturebeat.com , AIwire ,
Cerebras Systems is expanding its role in AI inference with a new partnership with Hugging Face and the launch of six new AI datacenters across North America and Europe. The partnership with Hugging Face integrates Cerebras' inference capabilities into the Hugging Face Hub, granting access to the platform's five million developers. This integration allows developers to use Cerebras as their inference provider for models like Llama 3.3 70B, powered by the Cerebras CS-3 systems.

Cerebras is also launching six new AI inference datacenters located across North America and Europe. Once fully operational, these centers are expected to significantly increase Cerebras' capacity to handle high-speed inference workloads, supporting over 40 million Llama 70B tokens per second. The expansion includes facilities in Dallas, Minneapolis, Oklahoma City, Montreal, New York and France, with 85% of the total capacity located in the United States.

Recommended read:
References :
  • venturebeat.com: Cerebras just announced 6 new AI datacenters that process 40M tokens per second — and it could be bad news for Nvidia
  • AIwire: Cerebras Scales AI Inference with Hugging Face Partnership and Datacenter Expansion
  • THE DECODER: Nvidia rival Cerebras opens six data centers for rapid AI inference