News from the AI & ML world

DeeperML - #aiinference

@www.artificialintelligence-news.com //
Hugging Face has partnered with Groq to offer ultra-fast AI model inference, integrating Groq's Language Processing Unit (LPU) inference engine as a native provider on the Hugging Face platform. This collaboration aims to provide developers with access to lightning-fast processing capabilities directly within the popular model hub. Groq's chips are specifically designed for language models, offering a specialized architecture that differs from traditional GPUs by embracing the sequential nature of language tasks, resulting in reduced response times and higher throughput for AI applications.

Developers can now access high-speed inference for multiple open-weight models through Groq’s infrastructure, including Meta’s Llama 4, Meta’s Llama-3 and Qwen’s QwQ-32B. Groq is the only inference provider to enable the full 131K context window, allowing developers to build applications at scale. The integration works seamlessly with Hugging Face’s client libraries for both Python and JavaScript, though the technical details remain refreshingly simple. Even without diving into code, developers can specify Groq as their preferred provider with minimal configuration.

This partnership marks Groq’s boldest attempt yet to carve out market share in the rapidly expanding AI inference market, where companies like AWS Bedrock, Google Vertex AI, and Microsoft Azure have dominated by offering convenient access to leading language models. This marks Groq's third major platform partnership in as many months. In April, Groq became the exclusive inference provider for Meta’s official Llama API, delivering speeds up to 625 tokens per second to enterprise customers. The following mo

Recommended read:
References :
  • venturebeat.com: Groq just made Hugging Face way faster — and it’s coming for AWS and Google
  • www.artificialintelligence-news.com: Hugging Face partners with Groq for ultra-fast AI model inference
  • www.rdworldonline.com: Hugging Face integrates Groq, offering native high-speed inference for 9 major open weight models
  • : Simplicity of Hugging Face + Efficiency of Groq Exciting news for developers and AI enthusiasts! Hugging Face is making it easier than ever to access Groq’s lightning-fast and efficient inference with the direct integration of Groq as a provider on the Hugging Face Playground and API.

staff@insideAI News //
NVIDIA has reportedly broken a new AI world record, achieving over 1,000 tokens per second (TPS) per user with Meta's Llama 4 Maverick large language model. This breakthrough was accomplished using NVIDIA's DGX B200 node, which is equipped with eight Blackwell GPUs. The performance was independently measured by the AI benchmarking service Artificial Analysis. NVIDIA's Blackwell architecture offers substantial improvements in processing power, which enables faster inference times for large language models.

This record-breaking result was achieved through extensive software optimizations, including the use of TensorRT and the training of a speculative decoding draft model using EAGLE-3 techniques. These optimizations alone resulted in a 4x performance increase compared to Blackwell's previous best results. NVIDIA also leveraged FP8 data types for GEMMs, Mixture of Experts (MoE), and Attention operations to reduce model size and capitalize on Blackwell Tensor Core technology's high FP8 throughput. The company claims that accuracy when using the FP8 data format matches that of Artificial Analysis BF16 across many metrics.

NVIDIA reports that Blackwell reaches 72,000 TPS/server at its highest throughput configuration. NVIDIA says it achieved a 4x speed-up relative to the best prior Blackwell baseline by using TensorRT-LLM and training a speculative decoding draft model using EAGLE-3 techniques. This milestone underscores the considerable progress made in AI inference capabilities through NVIDIA's hardware and software innovations, thereby clearing the way for more efficient and responsive AI applications.

Recommended read:
References :
  • insideAI News: Details of AI Inference: NVIDIA Reports Blackwell Surpasses 1000 TPS/User Barrier with Llama 4 Maverick
  • insidehpc.com: Details on NVIDIA Reports Blackwell Surpasses 1000 TPS/User Barrier with Meta’s Llama 4 Maverick
  • www.tomshardware.com: Reports that Nvidia has broken another AI world record, breaking over 1,000 TPS/user with a DGX B200 node boasting eight Blackwell GPUs inside.
  • insidehpc.com: NVIDIA said it has achieved a record large language model (LLM) inference speed, announcing that an NVIDIA DGX B200 node with eight NVIDIA Blackwell GPUs achieved more than 1,000 tokens per second (TPS) per user on the 400-billion-parameter Llama 4 Maverick model.
  • NVIDIA Technical Blog: NVIDIA has achieved a world-record large language model (LLM) inference speed. A single NVIDIA DGX B200 node with eight NVIDIA Blackwell GPUs can achieve over...
  • analyticsindiamag.com: Texas Instruments (TI) has announced a collaboration with NVIDIA to develop new power management and sensing technology aimed at supporting future high-voltage power systems in AI data centres.
  • www.servethehome.com: The Intel Xeon 6 with priority cores wins big at NVIDIA but there is a lot more going on in the release than meets the eye

@cloud.google.com //
References: Compute , github.com , cloud.google.com ...
Google Cloud is enhancing its AI Hypercomputer to accelerate AI inference workloads, focusing on maximizing performance and reducing costs for generative AI applications. At Google Cloud Next 25, updates to AI Hypercomputer's inference capabilities were shared, showcasing Google's newest Tensor Processing Unit (TPU) called Ironwood, designed for inference. Software enhancements include simple and performant inference using vLLM on TPU and the latest GKE inference capabilities such as GKE Inference Gateway and GKE Inference Quickstart. Google is paving the way for the next phase of AI's rapid evolution with the AI Hypercomputer.

Google's JetStream inference engine incorporates new performance optimizations, integrating Pathways for ultra-low latency multi-host, disaggregated serving. The sixth-generation Trillium TPU exceeds throughput performance by 2.9x for Llama 2 70B and 2.8x for Mixtral 8x7B compared to TPU v5e. Google’s JAX inference engine maximizes performance and reduces inference costs by offering more choice when serving LLMs on TPU. JetStream throughput is improved, achieving 1703 token/s on Llama 3.1 405B on Trillium.

Google is also intensifying its efforts to combat online scams by integrating artificial intelligence across Search, Chrome, and Android. AI is central to Google's anti-scam strategy, blocking hundreds of millions of scam results daily and identifying more fraudulent pages. Gemini Nano provides instant detection of high-risk websites, helping counter new and evolving scams across platforms. Google has long used AI to detect and block scams, including fake tech support, fraudulent financial services, and phishing links. Recent updates to AI classifiers now allow the company to detect 20 times more scam pages, improving the quality of search results by reducing exposure to harmful sites.

Recommended read:
References :
  • Compute: From LLMs to image generation: Accelerate inference workloads with AI Hypercomputer
  • github.com: To maximize performance and reduce inference costs, we are excited to offer more choice when serving LLMs on TPU, further enhancing JetStream and bringing , a widely-adopted fast and efficient library for serving LLMs.
  • blog.google: blog.google
  • cloud.google.com: Accelerating AI inference with Google Cloud TPUs and GPUs

staff@insideAI News //
References: insideAI News , Ken Yeung
Meta is partnering with Cerebras to enhance AI inference speeds within Meta's new Llama API. This collaboration combines Meta's open-source Llama models with Cerebras' specialized inference technology, aiming to provide developers with significantly faster performance. According to Cerebras, developers building on the Llama 4 Cerebras model within the API can expect speeds up to 18 times quicker than traditional GPU-based solutions. This acceleration is expected to unlock new possibilities for building real-time and agentic AI applications, making complex tasks like low-latency voice interaction, interactive code generation, and real-time reasoning more feasible.

This partnership allows Cerebras to expand its reach to a broader developer audience, strengthening its existing relationship with Meta. Since launching its inference solutions in 2024, Cerebras has emphasized its ability to deliver rapid Llama inference, serving billions of tokens through its AI infrastructure. Andrew Feldman, CEO and co-founder of Cerebras, stated that the company is proud to make Llama API the fastest inference API available, empowering developers to create AI systems previously unattainable with GPU-based inference clouds. Independent benchmarks by Artificial Analysis support this claim, indicating that Cerebras achieves significantly higher token processing speeds compared to platforms like ChatGPT and DeepSeek.

Developers will have direct access to the enhanced Llama 4 inference by selecting Cerebras within the Llama API. Meta also continues to innovate with its AI app, testing new features such as "Reasoning" mode and "Voice Personalization," designed to enhance user interaction. The “Reasoning” feature could potentially offer more transparent explanations for the AI’s responses, while voice settings like "Focus on my voice" and "Welcome message" could offer more personalized audio interactions, especially relevant for Meta's hardware ambitions in areas such as smart glasses and augmented reality devices.

Recommended read:
References :
  • insideAI News: Meta has teamed with Cerebras on AI inference in Meta’s new Llama API, combining Meta’s open-source Llama models with inference technology from Cerebras.
  • Ken Yeung: IN THIS ISSUE: Meta hosts its first-ever event around its Llama model, launching a standalone app to take on Microsoft’s Copilot and ChatGPT.