News from the AI & ML world

DeeperML - #inference

staff@insideAI News //
Google Cloud has unveiled its seventh-generation Tensor Processing Unit (TPU), named Ironwood, at the recent Google Cloud Next 2025 conference. This new custom AI accelerator is specifically designed for inference workloads, marking a shift in Google's AI chip development strategy. Ironwood aims to meet the growing demands of "thinking models" like Gemini 2.5, addressing the increasing shift from model training to inference observed across the industry. According to Amin Vahdat, Google's Vice President and General Manager of ML, Systems, and Cloud AI, the aim is to enter the "age of inference" where AI agents proactively retrieve and generate data for insights.

Ironwood's technical specifications are impressive, offering substantial computational power and efficiency. When scaled to a pod of 9,216 chips, it can deliver 42.5 exaflops of compute, surpassing the world's fastest supercomputer, El Capitan, by more than 24 times. Each individual Ironwood chip boasts a peak compute of 4,614 teraflops. To manage the communication demands of modern AI, each Ironwood setup features Inter-Chip Interconnect (ICI) networking spanning nearly 10 MW and each chip is equipped with 192GB of High Bandwidth Memory (HBM) and memory bandwidth that reaches 7.2 terabits per second.

This focus on inference is a response to the evolving AI landscape where proactive AI agents are becoming more prevalent. Ironwood is engineered to minimize data movement and latency on-chip while executing massive tensor manipulations, crucial for handling large language models and advanced reasoning tasks. Google emphasizes that Ironwood offers twice the performance per watt compared to its predecessor, Trillium, and is nearly 30 times more power efficient than Google’s first Cloud TPU from 2018, addressing the critical need for power efficiency in modern data centers.

Recommended read:
References :
  • insideAI News: Google Launches ‘Ironwood’ 7th Gen TPU for Inference
  • venturebeat.com: Google's new Ironwood chip is 24x more powerful than the world’s fastest supercomputer
  • www.bigdatawire.com: Google Cloud Preps for Agentic AI Era with ‘Ironwood’ TPU, New Models and Software
  • The Next Platform: With “Ironwood†TPU, Google Pushes The AI Accelerator To The Floor
  • www.itpro.com: Google Cloud Next 2025: Targeting easy AI

staff@insideAI News //
Google Cloud has unveiled its seventh-generation Tensor Processing Unit (TPU), named Ironwood. This custom AI accelerator is purpose-built for inference, marking a shift in Google's AI chip development strategy. While previous TPUs handled both training and inference, Ironwood is designed to optimize the deployment of trained AI models for making predictions and generating responses. According to Google, Ironwood will allow for a new "age of inference" where AI agents proactively retrieve and generate data, delivering insights and answers rather than just raw data.

Ironwood boasts impressive technical specifications. When scaled to 9,216 chips per pod, it delivers 42.5 exaflops of computing power. Each chip has a peak compute of 4,614 teraflops, accompanied by 192GB of High Bandwidth Memory. The memory bandwidth reaches 7.2 terabits per second per chip. Google highlights that Ironwood delivers twice the performance per watt compared to its predecessor and is nearly 30 times more power-efficient than Google's first Cloud TPU from 2018.

The focus on inference highlights a pivotal shift in the AI landscape. The industry has seen extensive development of large foundation models, and Ironwood is designed to manage the computational demands of these complex "thinking models," including large language models and Mixture of Experts (MoEs). Its architecture includes a low-latency, high-bandwidth Inter-Chip Interconnect (ICI) network to support coordinated communication at full TPU pod scale. The new TPU scales up to 9,216 liquid-cooled chips. This innovation is aimed at applications requiring real-time processing and predictions, and promises higher intelligence at lower costs.

Recommended read:
References :
  • insidehpc.com: Google Cloud today introduced its seventh-generation Tensor Processing Unit, "Ironwood," which the company said is it most performant and scalable custom AI accelerator and the first designed specifically for inference.
  • www.bigdatawire.com: Google Cloud Preps for Agentic AI Era with ‘Ironwood’ TPU, New Models and Software
  • www.nextplatform.com: With “Ironwood†TPU, Google Pushes The AI Accelerator To The Floor
  • insideAI News: Google today introduced its seventh-generation Tensor Processing Unit, “Ironwood,†which the company said is it most performant and scalable custom AI accelerator and the first designed specifically for inference.
  • venturebeat.com: Google's new Ironwood chip is 24x more powerful than the world's fastest supercomputer.
  • BigDATAwire: Google Cloud Preps for Agentic AI Era with ‘Ironwood’ TPU, New Models and Software
  • insidehpc.com: Google Cloud today introduced its seventh-generation Tensor Processing Unit, "Ironwood," which the company said is it most performant and scalable custom AI accelerator and the first designed specifically for inference.
  • the-decoder.com: Google unveils new AI models, infrastructure, and agent protocol at Cloud Next
  • AI News | VentureBeat: Google’s new Agent Development Kit lets enterprises rapidly prototype and deploy AI agents without recoding
  • Compute: Introducing Ironwood TPUs and new innovations in AI Hypercomputer
  • The Next Platform: With “Ironwood†TPU, Google Pushes The AI Accelerator To The Floor
  • Ken Yeung: Google Pushes Agent Interoperability With New Dev Kit and Agent2Agent Standard
  • The Tech Basic: Details Google Cloud's New AI Chip.
  • insideAI News: Google today introduced its seventh-generation Tensor Processing Unit, "Ironwood," which the company said is it most performant and scalable custom AI accelerator and the first designed specifically for inference.
  • venturebeat.com: Google unveils Ironwood TPUs, Gemini 2.5 "thinking models," and Agent2Agent protocol at Cloud Next '25, challenging Microsoft and Amazon with a comprehensive AI strategy that enables multiple AI systems to work together across platforms.
  • www.marktechpost.com: Google AI Introduces Ironwood: A Google TPU Purpose-Built for the Age of Inference
  • cloud.google.com: Introducing Ironwood TPUs and new innovations in AI Hypercomputer
  • Kyle Wiggers ?: Ironwood is Google’s newest AI accelerator chip

staff@insideAI News //
References: insideAI News , AIwire , insidehpc.com ...
MLCommons has released the latest MLPerf Inference v5.0 benchmark results, highlighting the growing importance of generative AI in the machine learning landscape. The new benchmarks feature tests for large language models (LLMs) like Llama 3.1 405B and Llama 2 70B Interactive, designed to evaluate how well systems perform in real-world applications requiring agentic reasoning and low-latency responses. This shift reflects the industry's increasing focus on deploying generative AI and the need for hardware and software optimized for these demanding workloads.

The v5.0 results reveal significant performance improvements driven by advancements in both hardware and software. The median submitted score for Llama 2 70B has doubled compared to a year ago, and the best score is 3.3 times faster than Inference v4.0. These gains are attributed to innovations like support for lower-precision computation formats such as FP4, which allows for more efficient processing of large models. The MLPerf Inference benchmark suite evaluates machine learning performance in a way that is architecture-neutral, reproducible, and representative of real-world workloads.

Recommended read:
References :
  • insideAI News: Today, MLCommons announced new results for its MLPerf Inference v5.0 benchmark suite, which delivers machine learning (ML) system performance benchmarking. The rorganization said the esults highlight that the AI community is focusing on generative AI ....
  • AIwire: MLPerf v5.0 Reflects the Shift Toward Reasoning in AI Inference
  • ServeTheHome: The new MLPerf Inference v5.0 results are out with new submissions for configurations from NVIDIA, Intel Xeon, and AMD Instinct MI325X The post appeared first on .
  • insidehpc.com: MLCommons Releases MLPerf Inference v5.0 Benchmark Results
  • www.networkworld.com: New MLCommons benchmarks to test AI infrastructure performance
  • SLVIKI.ORG: MLCommons Launches Next-Gen AI Benchmarks to Test the Limits of Generative Intelligence

Ellie Ramirez-Camara@Data Phoenix //
Nvidia's GTC 2025 event showcased the company's latest advancements in AI computing. A key highlight was the introduction of the Blackwell Ultra platform, designed to support the growing demands of AI reasoning, agentic AI, and physical AI applications. This next-generation platform builds upon the Blackwell architecture and includes the GB300 NVL72 rack-scale solution and the HGX B300 NVL16 system.

The Blackwell Ultra platform promises significantly enhanced AI computing power, with the GB300 NVL72 delivering 1.5x more AI performance than its predecessor and increasing revenue opportunities for AI factories by 50x. Major cloud providers and server manufacturers are expected to offer Blackwell Ultra-based products in the second half of 2025. Supporting this hardware is the new NVIDIA Dynamo open-source inference framework, which optimizes reasoning AI services across thousands of GPUs.

Recommended read:
References :
  • NVIDIA Newsroom: Innovation to Impact: How NVIDIA Research Fuels Transformative Work in AI, Graphics and Beyond
  • Data Phoenix: Nvidia introduces the Blackwell Ultra to support the rise of AI reasoning, agents, and physical AI
  • insideAI News: @HPCpodcast: Live from GTC 2025, Among the Crowds for the New AI Compute Landscape
  • Gradient Flow: Nvidia’s AI Vision: GTC 2025 and the Road Ahead
  • AIwire: Jensen Huang Charts Nvidia’s AI-Powered Future
  • BigDATAwire: Reporter’s Notebook: AI Hype and Glory at Nvidia GTC 2025
  • John Werner: Jensen Huang Hypes New Chips In Keynote