Noah Kravitz@NVIDIA Blog
//
NVIDIA is making strides in both agentic AI and open-source initiatives. Jacob Liberman, director of product management at NVIDIA, explains how agentic AI bridges the gap between powerful AI models and practical enterprise applications. Enterprises are now deploying AI agents to free human workers from time-consuming and error-prone tasks, allowing them to focus on high-value work that requires creativity and strategic thinking. NVIDIA AI Blueprints help enterprises build their own AI agents.
NVIDIA has announced the open-source release of the KAI Scheduler, a Kubernetes-native GPU scheduling solution, now available under the Apache 2.0 license. Originally developed within the Run:ai platform, the KAI Scheduler is now available to the community while also continuing to be packaged and delivered as part of the NVIDIA Run:ai platform. The KAI Scheduler is designed to optimize the scheduling of GPU resources and tackle challenges associated with managing AI workloads on GPUs and CPUs. Recommended read:
References :
@tomshardware.com
//
Nvidia has unveiled its next-generation data center GPU, the Blackwell Ultra, at its GTC event in San Jose. Expanding on the Blackwell architecture, the Blackwell Ultra GPU will be integrated into the DGX GB300 and DGX B300 systems. The DGX GB300 system, designed with a rack-scale, liquid-cooled architecture, is powered by the Grace Blackwell Ultra Superchip, combining 36 NVIDIA Grace CPUs and 72 NVIDIA Blackwell Ultra GPUs. Nvidia officially revealed its Blackwell Ultra B300 data center GPU, which packs up to 288GB of HBM3e memory and offers 1.5X the compute potential of the existing B200 solution.
The Blackwell Ultra GPU promises a 70x speedup in AI inference and reasoning compared to the previous Hopper-based generation. This improvement is achieved through hardware and networking advancements in the DGX GB300 system. Blackwell Ultra is designed to meet the demand for test-time scaling inference with a 1.5X increase in the FP4 compute. Nvidia's CEO, Jensen Huang, suggests that the new Blackwell chips render the previous generation obsolete, emphasizing the significant leap forward in AI infrastructure. Recommended read:
References :
Harsh Mishra@Analytics Vidhya
//
DeepSeek AI has been making significant contributions to the open-source community, particularly in the realm of AI model efficiency and accessibility. They recently launched the Fire-Flyer File System (3FS), a high-performance distributed file system tailored for AI training and inference workloads. This system is designed to address the challenges of managing large-scale, concurrent data access, a common bottleneck in traditional file systems. 3FS leverages modern SSDs and RDMA networks, offering a shared storage layer that facilitates the development of distributed applications by bypassing limitations seen in more traditional, locality-dependent file systems.
DeepSeek's commitment extends to data processing and model optimization. They have introduced the Smallpond framework for data processing and released quantized DeepSeek-R1 models, optimized for deployment-ready reasoning tasks. The quantized models, including Llama-8B, Llama-70B, Qwen-1.5B, Qwen-7B, Qwen-14B, and Qwen-32B, are available as a Hugging Face collection with evaluations, benchmarks, and setup instructions. These models maintain competitive reasoning accuracy while unlocking significant inference speedups. Recommended read:
References :
staff@insideAI News
//
MLCommons has released the latest MLPerf Inference v5.0 benchmark results, highlighting the growing importance of generative AI in the machine learning landscape. The new benchmarks feature tests for large language models (LLMs) like Llama 3.1 405B and Llama 2 70B Interactive, designed to evaluate how well systems perform in real-world applications requiring agentic reasoning and low-latency responses. This shift reflects the industry's increasing focus on deploying generative AI and the need for hardware and software optimized for these demanding workloads.
The v5.0 results reveal significant performance improvements driven by advancements in both hardware and software. The median submitted score for Llama 2 70B has doubled compared to a year ago, and the best score is 3.3 times faster than Inference v4.0. These gains are attributed to innovations like support for lower-precision computation formats such as FP4, which allows for more efficient processing of large models. The MLPerf Inference benchmark suite evaluates machine learning performance in a way that is architecture-neutral, reproducible, and representative of real-world workloads. Recommended read:
References :
Ryan Daws@AI News
//
NVIDIA has launched Dynamo, an open-source inference software, designed to accelerate and scale reasoning models within AI factories. Dynamo succeeds the NVIDIA Triton Inference Server, representing a new generation of AI inference software specifically engineered to maximize token revenue generation for AI factories deploying reasoning AI models. The software orchestrates and accelerates inference communication across thousands of GPUs, utilizing disaggregated serving.
Dynamo optimizes AI factories by dynamically managing GPU resources in real-time to adapt to request volumes. Dynamo’s intelligent inference optimizations have shown to boost the number of tokens generated by over 30 times per GPU and has demonstrated the ability to double the performance and revenue of AI factories serving Llama models on NVIDIA’s current Hopper platform. Recommended read:
References :
Jaime Hampton@AIwire
//
References:
venturebeat.com
, AIwire
,
Cerebras Systems is expanding its role in AI inference with a new partnership with Hugging Face and the launch of six new AI datacenters across North America and Europe. The partnership with Hugging Face integrates Cerebras' inference capabilities into the Hugging Face Hub, granting access to the platform's five million developers. This integration allows developers to use Cerebras as their inference provider for models like Llama 3.3 70B, powered by the Cerebras CS-3 systems.
Cerebras is also launching six new AI inference datacenters located across North America and Europe. Once fully operational, these centers are expected to significantly increase Cerebras' capacity to handle high-speed inference workloads, supporting over 40 million Llama 70B tokens per second. The expansion includes facilities in Dallas, Minneapolis, Oklahoma City, Montreal, New York and France, with 85% of the total capacity located in the United States. Recommended read:
References :
Ben Lorica@Gradient Flow
//
References:
Gradient Flow
, bdtechtalks.com
,
Nvidia's Dynamo is a new open-source framework designed to tackle the complexities of scaling AI inference operations. Dynamo optimizes how large language models operate across multiple GPUs, balancing individual performance with system-wide throughput. Introduced at the GPU Technology Conference, Nvidia CEO Jensen Huang has described it as "the operating system of an AI factory".
This framework includes components designed to function as an "air traffic control system" for AI processing. These key components include libraries like TensorRT-LLM and SGLang, which provide efficient mechanisms for handling token generation, memory management, and batch processing to improve throughput and reduce latency when serving AI models. Nvidia's nGPT combines transformers and state-space models to reduce costs and increase speed while maintaining accuracy. Recommended read:
References :
|
BenchmarksBlogsResearch Tools |