News from the AI & ML world

DeeperML

Brian Wang@NextBigFuture.com //
Leaked benchmarks indicate that xAI's upcoming Grok 4 model could be a significant advancement in AI. The benchmarks suggest a major leap in capability, with Grok 4 potentially outperforming existing leading models. The leaked data reveals impressive scores across several benchmarks, including the 'Humanity Last Exam' (HLE), GPQA, and SWE Bench. These results suggest that Grok 4 is positioning itself as a leader in the AI space, with significant improvements over its predecessors and competitors.

The benchmarks showcase Grok 4's strength in various areas. On the HLE, Grok 4 achieved a 35% score, which increased to 45% with enhanced reasoning capabilities. This marks a substantial improvement over previous top models, which scored around 21%. The GPQA benchmark saw Grok 4 achieve an impressive 87-88%, while the specialized "Grok 4 Code" variant scored 72-75% on the SWE Bench. These scores highlight Grok 4's proficiency in complex problem-solving, coding, and logical reasoning.

The timing of the Grok 4 launch is crucial for xAI, as competition in the AI landscape intensifies. With rivals like OpenAI and Google expected to release new models soon, xAI aims to establish Grok 4 as a frontrunner. The new features and performance enhancements are expected to be accessible through the xAI developer console and API, potentially extending to consumer products. If the benchmark claims are accurate, Grok 4 could solidify xAI's position as a leading AI research lab, but its success hinges on the actual release and real-world performance.
Original img attribution: https://nextbigfuture.s3.amazonaws.com/uploads/2025/07/xaigrok4-1.jpeg
ImgSrc: nextbigfuture.s

Share: bluesky twitterx--v2 facebook--v1 threads


References :
  • NextBigFuture.com: XAI Grok 4 Benchmarks are showing it is the leading model. Humanity Last Exam at 35 and 45 for reasoning is a big improvement from about 21 for other top models. If these leaked Grok 4 benchmarks are correct, 95 AIME, 88 GPQA, 75 SWE-bench, then XAI has the most powerful model on the market. ...
  • TestingCatalog: Grok 4 will be SOTA, according to the leaked benchmarks; 35% on HLE, 45% with reasoning; 87-88% on GPQA; 72-75% on SWE Bench (for Grok 4 Code)
Classification:
  • HashTags: #AI #Grok4 #xAI
  • Company: xAI
  • Product: Grok 4
  • Feature: Reasoning & Code
  • Type: Research
  • Severity: Medium