@www.artificialintelligence-news.com
//
Hugging Face has partnered with Groq to offer ultra-fast AI model inference, integrating Groq's Language Processing Unit (LPU) inference engine as a native provider on the Hugging Face platform. This collaboration aims to provide developers with access to lightning-fast processing capabilities directly within the popular model hub. Groq's chips are specifically designed for language models, offering a specialized architecture that differs from traditional GPUs by embracing the sequential nature of language tasks, resulting in reduced response times and higher throughput for AI applications.
Developers can now access high-speed inference for multiple open-weight models through Groq’s infrastructure, including Meta’s Llama 4, Meta’s Llama-3 and Qwen’s QwQ-32B. Groq is the only inference provider to enable the full 131K context window, allowing developers to build applications at scale. The integration works seamlessly with Hugging Face’s client libraries for both Python and JavaScript, though the technical details remain refreshingly simple. Even without diving into code, developers can specify Groq as their preferred provider with minimal configuration. This partnership marks Groq’s boldest attempt yet to carve out market share in the rapidly expanding AI inference market, where companies like AWS Bedrock, Google Vertex AI, and Microsoft Azure have dominated by offering convenient access to leading language models. This marks Groq's third major platform partnership in as many months. In April, Groq became the exclusive inference provider for Meta’s official Llama API, delivering speeds up to 625 tokens per second to enterprise customers. The following mo Recommended read:
References :
staff@insideAI News
//
NVIDIA has reportedly broken a new AI world record, achieving over 1,000 tokens per second (TPS) per user with Meta's Llama 4 Maverick large language model. This breakthrough was accomplished using NVIDIA's DGX B200 node, which is equipped with eight Blackwell GPUs. The performance was independently measured by the AI benchmarking service Artificial Analysis. NVIDIA's Blackwell architecture offers substantial improvements in processing power, which enables faster inference times for large language models.
This record-breaking result was achieved through extensive software optimizations, including the use of TensorRT and the training of a speculative decoding draft model using EAGLE-3 techniques. These optimizations alone resulted in a 4x performance increase compared to Blackwell's previous best results. NVIDIA also leveraged FP8 data types for GEMMs, Mixture of Experts (MoE), and Attention operations to reduce model size and capitalize on Blackwell Tensor Core technology's high FP8 throughput. The company claims that accuracy when using the FP8 data format matches that of Artificial Analysis BF16 across many metrics. NVIDIA reports that Blackwell reaches 72,000 TPS/server at its highest throughput configuration. NVIDIA says it achieved a 4x speed-up relative to the best prior Blackwell baseline by using TensorRT-LLM and training a speculative decoding draft model using EAGLE-3 techniques. This milestone underscores the considerable progress made in AI inference capabilities through NVIDIA's hardware and software innovations, thereby clearing the way for more efficient and responsive AI applications. Recommended read:
References :
@cloud.google.com
//
Google Cloud is enhancing its AI Hypercomputer to accelerate AI inference workloads, focusing on maximizing performance and reducing costs for generative AI applications. At Google Cloud Next 25, updates to AI Hypercomputer's inference capabilities were shared, showcasing Google's newest Tensor Processing Unit (TPU) called Ironwood, designed for inference. Software enhancements include simple and performant inference using vLLM on TPU and the latest GKE inference capabilities such as GKE Inference Gateway and GKE Inference Quickstart. Google is paving the way for the next phase of AI's rapid evolution with the AI Hypercomputer.
Google's JetStream inference engine incorporates new performance optimizations, integrating Pathways for ultra-low latency multi-host, disaggregated serving. The sixth-generation Trillium TPU exceeds throughput performance by 2.9x for Llama 2 70B and 2.8x for Mixtral 8x7B compared to TPU v5e. Google’s JAX inference engine maximizes performance and reduces inference costs by offering more choice when serving LLMs on TPU. JetStream throughput is improved, achieving 1703 token/s on Llama 3.1 405B on Trillium. Google is also intensifying its efforts to combat online scams by integrating artificial intelligence across Search, Chrome, and Android. AI is central to Google's anti-scam strategy, blocking hundreds of millions of scam results daily and identifying more fraudulent pages. Gemini Nano provides instant detection of high-risk websites, helping counter new and evolving scams across platforms. Google has long used AI to detect and block scams, including fake tech support, fraudulent financial services, and phishing links. Recent updates to AI classifiers now allow the company to detect 20 times more scam pages, improving the quality of search results by reducing exposure to harmful sites. Recommended read:
References :
staff@insideAI News
//
References:
insideAI News
, Ken Yeung
Meta is partnering with Cerebras to enhance AI inference speeds within Meta's new Llama API. This collaboration combines Meta's open-source Llama models with Cerebras' specialized inference technology, aiming to provide developers with significantly faster performance. According to Cerebras, developers building on the Llama 4 Cerebras model within the API can expect speeds up to 18 times quicker than traditional GPU-based solutions. This acceleration is expected to unlock new possibilities for building real-time and agentic AI applications, making complex tasks like low-latency voice interaction, interactive code generation, and real-time reasoning more feasible.
This partnership allows Cerebras to expand its reach to a broader developer audience, strengthening its existing relationship with Meta. Since launching its inference solutions in 2024, Cerebras has emphasized its ability to deliver rapid Llama inference, serving billions of tokens through its AI infrastructure. Andrew Feldman, CEO and co-founder of Cerebras, stated that the company is proud to make Llama API the fastest inference API available, empowering developers to create AI systems previously unattainable with GPU-based inference clouds. Independent benchmarks by Artificial Analysis support this claim, indicating that Cerebras achieves significantly higher token processing speeds compared to platforms like ChatGPT and DeepSeek. Developers will have direct access to the enhanced Llama 4 inference by selecting Cerebras within the Llama API. Meta also continues to innovate with its AI app, testing new features such as "Reasoning" mode and "Voice Personalization," designed to enhance user interaction. The “Reasoning” feature could potentially offer more transparent explanations for the AI’s responses, while voice settings like "Focus on my voice" and "Welcome message" could offer more personalized audio interactions, especially relevant for Meta's hardware ambitions in areas such as smart glasses and augmented reality devices. Recommended read:
References :
|
BenchmarksBlogsResearch Tools |