staff@insideAI News
//
NVIDIA has reportedly broken a new AI world record, achieving over 1,000 tokens per second (TPS) per user with Meta's Llama 4 Maverick large language model. This breakthrough was accomplished using NVIDIA's DGX B200 node, which is equipped with eight Blackwell GPUs. The performance was independently measured by the AI benchmarking service Artificial Analysis. NVIDIA's Blackwell architecture offers substantial improvements in processing power, which enables faster inference times for large language models.
This record-breaking result was achieved through extensive software optimizations, including the use of TensorRT and the training of a speculative decoding draft model using EAGLE-3 techniques. These optimizations alone resulted in a 4x performance increase compared to Blackwell's previous best results. NVIDIA also leveraged FP8 data types for GEMMs, Mixture of Experts (MoE), and Attention operations to reduce model size and capitalize on Blackwell Tensor Core technology's high FP8 throughput. The company claims that accuracy when using the FP8 data format matches that of Artificial Analysis BF16 across many metrics. NVIDIA reports that Blackwell reaches 72,000 TPS/server at its highest throughput configuration. NVIDIA says it achieved a 4x speed-up relative to the best prior Blackwell baseline by using TensorRT-LLM and training a speculative decoding draft model using EAGLE-3 techniques. This milestone underscores the considerable progress made in AI inference capabilities through NVIDIA's hardware and software innovations, thereby clearing the way for more efficient and responsive AI applications. Recommended read:
References :
@cloud.google.com
//
Google Cloud is enhancing its AI Hypercomputer to accelerate AI inference workloads, focusing on maximizing performance and reducing costs for generative AI applications. At Google Cloud Next 25, updates to AI Hypercomputer's inference capabilities were shared, showcasing Google's newest Tensor Processing Unit (TPU) called Ironwood, designed for inference. Software enhancements include simple and performant inference using vLLM on TPU and the latest GKE inference capabilities such as GKE Inference Gateway and GKE Inference Quickstart. Google is paving the way for the next phase of AI's rapid evolution with the AI Hypercomputer.
Google's JetStream inference engine incorporates new performance optimizations, integrating Pathways for ultra-low latency multi-host, disaggregated serving. The sixth-generation Trillium TPU exceeds throughput performance by 2.9x for Llama 2 70B and 2.8x for Mixtral 8x7B compared to TPU v5e. Google’s JAX inference engine maximizes performance and reduces inference costs by offering more choice when serving LLMs on TPU. JetStream throughput is improved, achieving 1703 token/s on Llama 3.1 405B on Trillium. Google is also intensifying its efforts to combat online scams by integrating artificial intelligence across Search, Chrome, and Android. AI is central to Google's anti-scam strategy, blocking hundreds of millions of scam results daily and identifying more fraudulent pages. Gemini Nano provides instant detection of high-risk websites, helping counter new and evolving scams across platforms. Google has long used AI to detect and block scams, including fake tech support, fraudulent financial services, and phishing links. Recent updates to AI classifiers now allow the company to detect 20 times more scam pages, improving the quality of search results by reducing exposure to harmful sites. Recommended read:
References :
staff@insideAI News
//
References:
insideAI News
, Ken Yeung
Meta is partnering with Cerebras to enhance AI inference speeds within Meta's new Llama API. This collaboration combines Meta's open-source Llama models with Cerebras' specialized inference technology, aiming to provide developers with significantly faster performance. According to Cerebras, developers building on the Llama 4 Cerebras model within the API can expect speeds up to 18 times quicker than traditional GPU-based solutions. This acceleration is expected to unlock new possibilities for building real-time and agentic AI applications, making complex tasks like low-latency voice interaction, interactive code generation, and real-time reasoning more feasible.
This partnership allows Cerebras to expand its reach to a broader developer audience, strengthening its existing relationship with Meta. Since launching its inference solutions in 2024, Cerebras has emphasized its ability to deliver rapid Llama inference, serving billions of tokens through its AI infrastructure. Andrew Feldman, CEO and co-founder of Cerebras, stated that the company is proud to make Llama API the fastest inference API available, empowering developers to create AI systems previously unattainable with GPU-based inference clouds. Independent benchmarks by Artificial Analysis support this claim, indicating that Cerebras achieves significantly higher token processing speeds compared to platforms like ChatGPT and DeepSeek. Developers will have direct access to the enhanced Llama 4 inference by selecting Cerebras within the Llama API. Meta also continues to innovate with its AI app, testing new features such as "Reasoning" mode and "Voice Personalization," designed to enhance user interaction. The “Reasoning” feature could potentially offer more transparent explanations for the AI’s responses, while voice settings like "Focus on my voice" and "Welcome message" could offer more personalized audio interactions, especially relevant for Meta's hardware ambitions in areas such as smart glasses and augmented reality devices. Recommended read:
References :
staff@insideAI News
//
Google Cloud has unveiled its seventh-generation Tensor Processing Unit (TPU), named Ironwood. This custom AI accelerator is purpose-built for inference, marking a shift in Google's AI chip development strategy. While previous TPUs handled both training and inference, Ironwood is designed to optimize the deployment of trained AI models for making predictions and generating responses. According to Google, Ironwood will allow for a new "age of inference" where AI agents proactively retrieve and generate data, delivering insights and answers rather than just raw data.
Ironwood boasts impressive technical specifications. When scaled to 9,216 chips per pod, it delivers 42.5 exaflops of computing power. Each chip has a peak compute of 4,614 teraflops, accompanied by 192GB of High Bandwidth Memory. The memory bandwidth reaches 7.2 terabits per second per chip. Google highlights that Ironwood delivers twice the performance per watt compared to its predecessor and is nearly 30 times more power-efficient than Google's first Cloud TPU from 2018. The focus on inference highlights a pivotal shift in the AI landscape. The industry has seen extensive development of large foundation models, and Ironwood is designed to manage the computational demands of these complex "thinking models," including large language models and Mixture of Experts (MoEs). Its architecture includes a low-latency, high-bandwidth Inter-Chip Interconnect (ICI) network to support coordinated communication at full TPU pod scale. The new TPU scales up to 9,216 liquid-cooled chips. This innovation is aimed at applications requiring real-time processing and predictions, and promises higher intelligence at lower costs. Recommended read:
References :
staff@insideAI News
//
MLCommons has released the latest MLPerf Inference v5.0 benchmark results, highlighting the growing importance of generative AI in the machine learning landscape. The new benchmarks feature tests for large language models (LLMs) like Llama 3.1 405B and Llama 2 70B Interactive, designed to evaluate how well systems perform in real-world applications requiring agentic reasoning and low-latency responses. This shift reflects the industry's increasing focus on deploying generative AI and the need for hardware and software optimized for these demanding workloads.
The v5.0 results reveal significant performance improvements driven by advancements in both hardware and software. The median submitted score for Llama 2 70B has doubled compared to a year ago, and the best score is 3.3 times faster than Inference v4.0. These gains are attributed to innovations like support for lower-precision computation formats such as FP4, which allows for more efficient processing of large models. The MLPerf Inference benchmark suite evaluates machine learning performance in a way that is architecture-neutral, reproducible, and representative of real-world workloads. Recommended read:
References :
@www.intel.com
//
NVIDIA's Blackwell platform has dominated the latest MLPerf Inference V5.0 benchmarks, showcasing significant performance improvements in AI reasoning. The NVIDIA GB200 NVL72 system, featuring 72 Blackwell GPUs, achieved up to 30x higher throughput on the Llama 3.1 405B benchmark compared to the NVIDIA H200 NVL8 submission. This was driven by more than triple the performance per GPU and a 9x larger NVIDIA NVLink interconnect domain. The latest MLPerf results reflect the shift toward reasoning in AI inference.
Alongside this achievement, NVIDIA is open-sourcing the KAI Scheduler, a Kubernetes GPU scheduling solution, as part of its commitment to open-source AI innovation. Previously a core component of the Run:ai platform, KAI Scheduler is now available under the Apache 2.0 license. This solution is designed to address the unique challenges of managing AI workloads that utilize both GPUs and CPUs. According to NVIDIA, this will help in managing fluctuating GPU demands, which traditional resource schedulers struggle to handle. Recommended read:
References :
@www.intel.com
//
NVIDIA is making strides in both agentic AI and open-source initiatives. Jacob Liberman, director of product management at NVIDIA, explains how agentic AI bridges the gap between powerful AI models and practical enterprise applications. Enterprises are now deploying AI agents to free human workers from time-consuming and error-prone tasks, allowing them to focus on high-value work that requires creativity and strategic thinking. NVIDIA AI Blueprints help enterprises build their own AI agents.
NVIDIA has announced the open-source release of the KAI Scheduler, a Kubernetes-native GPU scheduling solution, now available under the Apache 2.0 license. Originally developed within the Run:ai platform, the KAI Scheduler is now available to the community while also continuing to be packaged and delivered as part of the NVIDIA Run:ai platform. The KAI Scheduler is designed to optimize the scheduling of GPU resources and tackle challenges associated with managing AI workloads on GPUs and CPUs. Recommended read:
References :
Ben Lorica@Gradient Flow
//
References:
Gradient Flow
, bdtechtalks.com
,
Nvidia's Dynamo is a new open-source framework designed to tackle the complexities of scaling AI inference operations. Dynamo optimizes how large language models operate across multiple GPUs, balancing individual performance with system-wide throughput. Introduced at the GPU Technology Conference, Nvidia CEO Jensen Huang has described it as "the operating system of an AI factory".
This framework includes components designed to function as an "air traffic control system" for AI processing. These key components include libraries like TensorRT-LLM and SGLang, which provide efficient mechanisms for handling token generation, memory management, and batch processing to improve throughput and reduce latency when serving AI models. Nvidia's nGPT combines transformers and state-space models to reduce costs and increase speed while maintaining accuracy. Recommended read:
References :
Ryan Daws@AI News
//
NVIDIA has launched Dynamo, an open-source inference software, designed to accelerate and scale reasoning models within AI factories. Dynamo succeeds the NVIDIA Triton Inference Server, representing a new generation of AI inference software specifically engineered to maximize token revenue generation for AI factories deploying reasoning AI models. The software orchestrates and accelerates inference communication across thousands of GPUs, utilizing disaggregated serving.
Dynamo optimizes AI factories by dynamically managing GPU resources in real-time to adapt to request volumes. Dynamo’s intelligent inference optimizations have shown to boost the number of tokens generated by over 30 times per GPU and has demonstrated the ability to double the performance and revenue of AI factories serving Llama models on NVIDIA’s current Hopper platform. Recommended read:
References :
@tomshardware.com
//
Nvidia has unveiled its next-generation data center GPU, the Blackwell Ultra, at its GTC event in San Jose. Expanding on the Blackwell architecture, the Blackwell Ultra GPU will be integrated into the DGX GB300 and DGX B300 systems. The DGX GB300 system, designed with a rack-scale, liquid-cooled architecture, is powered by the Grace Blackwell Ultra Superchip, combining 36 NVIDIA Grace CPUs and 72 NVIDIA Blackwell Ultra GPUs. Nvidia officially revealed its Blackwell Ultra B300 data center GPU, which packs up to 288GB of HBM3e memory and offers 1.5X the compute potential of the existing B200 solution.
The Blackwell Ultra GPU promises a 70x speedup in AI inference and reasoning compared to the previous Hopper-based generation. This improvement is achieved through hardware and networking advancements in the DGX GB300 system. Blackwell Ultra is designed to meet the demand for test-time scaling inference with a 1.5X increase in the FP4 compute. Nvidia's CEO, Jensen Huang, suggests that the new Blackwell chips render the previous generation obsolete, emphasizing the significant leap forward in AI infrastructure. Recommended read:
References :
Jaime Hampton@AIwire
//
References:
venturebeat.com
, AIwire
,
Cerebras Systems is expanding its role in AI inference with a new partnership with Hugging Face and the launch of six new AI datacenters across North America and Europe. The partnership with Hugging Face integrates Cerebras' inference capabilities into the Hugging Face Hub, granting access to the platform's five million developers. This integration allows developers to use Cerebras as their inference provider for models like Llama 3.3 70B, powered by the Cerebras CS-3 systems.
Cerebras is also launching six new AI inference datacenters located across North America and Europe. Once fully operational, these centers are expected to significantly increase Cerebras' capacity to handle high-speed inference workloads, supporting over 40 million Llama 70B tokens per second. The expansion includes facilities in Dallas, Minneapolis, Oklahoma City, Montreal, New York and France, with 85% of the total capacity located in the United States. Recommended read:
References :
|
BenchmarksBlogsResearch Tools |