News from the AI & ML world
staff@insideAI News
//
NVIDIA has reportedly broken a new AI world record, achieving over 1,000 tokens per second (TPS) per user with Meta's Llama 4 Maverick large language model. This breakthrough was accomplished using NVIDIA's DGX B200 node, which is equipped with eight Blackwell GPUs. The performance was independently measured by the AI benchmarking service Artificial Analysis. NVIDIA's Blackwell architecture offers substantial improvements in processing power, which enables faster inference times for large language models.
This record-breaking result was achieved through extensive software optimizations, including the use of TensorRT and the training of a speculative decoding draft model using EAGLE-3 techniques. These optimizations alone resulted in a 4x performance increase compared to Blackwell's previous best results. NVIDIA also leveraged FP8 data types for GEMMs, Mixture of Experts (MoE), and Attention operations to reduce model size and capitalize on Blackwell Tensor Core technology's high FP8 throughput. The company claims that accuracy when using the FP8 data format matches that of Artificial Analysis BF16 across many metrics.
NVIDIA reports that Blackwell reaches 72,000 TPS/server at its highest throughput configuration. NVIDIA says it achieved a 4x speed-up relative to the best prior Blackwell baseline by using TensorRT-LLM and training a speculative decoding draft model using EAGLE-3 techniques. This milestone underscores the considerable progress made in AI inference capabilities through NVIDIA's hardware and software innovations, thereby clearing the way for more efficient and responsive AI applications.
ImgSrc: insideainews.co
References :
- insideAI News: Details of AI Inference: NVIDIA Reports Blackwell Surpasses 1000 TPS/User Barrier with Llama 4 Maverick
- insidehpc.com: Details on NVIDIA Reports Blackwell Surpasses 1000 TPS/User Barrier with Meta’s Llama 4 Maverick
- www.tomshardware.com: Reports that Nvidia has broken another AI world record, breaking over 1,000 TPS/user with a DGX B200 node boasting eight Blackwell GPUs inside.
- insidehpc.com: NVIDIA said it has achieved a record large language model (LLM) inference speed, announcing that an NVIDIA DGX B200 node with eight NVIDIA Blackwell GPUs achieved more than 1,000Â tokens per second (TPS) per user on the 400-billion-parameter Llama 4 Maverick model.
- NVIDIA Technical Blog: NVIDIA has achieved a world-record large language model (LLM) inference speed. A single NVIDIA DGX B200 node with eight NVIDIA Blackwell GPUs can achieve over...
- analyticsindiamag.com: Texas Instruments (TI) has announced a collaboration with NVIDIA to develop new power management and sensing technology aimed at supporting future high-voltage power systems in AI data centres.
- www.servethehome.com: The Intel Xeon 6 with priority cores wins big at NVIDIA but there is a lot more going on in the release than meets the eye
Classification:
- HashTags: #AIInference #NVIDIADGX #BlackwellGPU
- Company: NVIDIA
- Target: AI models
- Product: DGX B200
- Feature: AI Inference
- Type: AI
- Severity: Informative