Google Accelerates AI Inference with Hypercomputer Updates

@cloud.google.com //

Google Accelerates AI Inference with Hypercomputer Updates

Google Cloud is enhancing its AI Hypercomputer to accelerate AI inference workloads, focusing on maximizing performance and reducing costs for generative AI applications. At Google Cloud Next 25, updates to AI Hypercomputer's inference capabilities were shared, showcasing Google's newest Tensor Processing Unit (TPU) called Ironwood, designed for inference. Software enhancements include simple and performant inference using vLLM on TPU and the latest GKE inference capabilities such as GKE Inference Gateway and GKE Inference Quickstart. Google is paving the way for the next phase of AI's rapid evolution with the AI Hypercomputer.

Google's JetStream inference engine incorporates new performance optimizations, integrating Pathways for ultra-low latency multi-host, disaggregated serving. The sixth-generation Trillium TPU exceeds throughput performance by 2.9x for Llama 2 70B and 2.8x for Mixtral 8x7B compared to TPU v5e. Google’s JAX inference engine maximizes performance and reduces inference costs by offering more choice when serving LLMs on TPU. JetStream throughput is improved, achieving 1703 token/s on Llama 3.1 405B on Trillium.

Google is also intensifying its efforts to combat online scams by integrating artificial intelligence across Search, Chrome, and Android. AI is central to Google's anti-scam strategy, blocking hundreds of millions of scam results daily and identifying more fraudulent pages. Gemini Nano provides instant detection of high-risk websites, helping counter new and evolving scams across platforms. Google has long used AI to detect and block scams, including fake tech support, fraudulent financial services, and phishing links. Recent updates to AI classifiers now allow the company to detect 20 times more scam pages, improving the quality of search results by reducing exposure to harmful sites.

Original img attribution: https://storage.googleapis.com/gweb-cloudblog-publish/images/05_-_Compute.max-2600x2600.jpg

ImgSrc: storage.googlea

References :

Compute: From LLMs to image generation: Accelerate inference workloads with AI Hypercomputer
github.com: To maximize performance and reduce inference costs, we are excited to offer more choice when serving LLMs on TPU, further enhancing JetStream and bringing , a widely-adopted fast and efficient library for serving LLMs.
blog.google: blog.google
cloud.google.com: Accelerating AI inference with Google Cloud TPUs and GPUs

Classification:

HashTags: #GoogleCloud #AIInference #GenerativeAI
Company: Google
Target: Customers
Product: Google Cloud
Feature: AI Hypercomputer
Type: AI
Severity: Informative

News from the AI & ML world

DeeperML

Google Accelerates AI Inference with Hypercomputer Updates

Classification: