News from the AI & ML world

DeeperML - #aibenchmark

Asif Razzaq@MarkTechPost //
OpenAI has unveiled PaperBench, a new benchmark designed to rigorously assess the ability of AI agents to autonomously replicate cutting-edge machine learning research. The benchmark consists of 20 papers from ICML 2024, spanning areas like reinforcement learning and probabilistic methods. PaperBench measures if AI systems can accurately interpret research papers, independently develop codebases, and execute experiments to replicate empirical outcomes. To ensure genuine independent replication, agents are prohibited from referencing original authors' code.

The effort involves systematic evaluation tools and detailed rubrics, co-developed with original paper authors, specifying 8,316 individually gradable tasks to facilitate precise evaluation of AI capabilities. OpenAI is also escalating competition with Anthropic by offering free ChatGPT Plus subscriptions to college students in the US and Canada through the end of May. This move gives millions of students access to OpenAI’s premium service just as they prepare for final exams, providing capabilities like GPT-4o, image generation, voice interaction, and advanced research tools.

Recommended read:
References :
  • venturebeat.com: OpenAI just made ChatGPT Plus free for millions of college students — and it’s a brilliant competitive move against Anthropic
  • MarkTechPost: Open AI Releases PaperBench: A Challenging Benchmark for Assessing AI Agents’ Abilities to Replicate Cutting-Edge Machine Learning Research
  • www.techradar.com: OpenAI is giving away ChatGPT Plus subscriptions to students to help you study for finals – here’s how to apply
  • THE DECODER: Anthropic brings AI assistant Claude to university campuses
  • www.techradar.com: ChatGPT-5 is on hold as OpenAI changes plans and releases new o3 and o4-mini models
  • BleepingComputer: BleepingComputer about OpenAI's ChatGPT plus free for students
  • www.zdnet.com: ChatGPT Plus is free for students now - how to grab this deal before finals
  • The Tech Basic: OpenAI and Anthropic are fighting to be students’ favorite AI tools. This week, both released free helpers for college kids. Why? They know students are busy with classes, jobs, and exams. If students use their AI now, they might keep using it after graduation. Why Students? College life is tough. A student
  • THE DECODER: OpenAI plans GPT-5 release in "a few months," shifts strategy on reasoning models

Ellie Ramirez-Camara@Data Phoenix //
References: RunPod Blog , Data Phoenix , eWEEK ...
The ARC Prize Foundation has launched ARC-AGI-2, a new AI benchmark designed to challenge current foundation models and track progress towards artificial general intelligence (AGI). Building on the original ARC benchmark, ARC-AGI-2 blocks brute force techniques and introduces new tasks intended for next-generation AI systems. The goal is to evaluate real progress toward AGI by requiring models to reason abstractly, generalize from few examples, and apply knowledge in new contexts, tasks that are simple for humans but difficult for machines.

The Foundation has also announced the ARC Prize 2025, a competition running from March 26 to November 3, with a grand prize of $700,000 for a solution achieving an 85% score on the ARC-AGI-2 benchmark's private evaluation dataset. Early testing results show that even OpenAI's top models experienced a significant performance drop, with o3 falling from 75% to approximately 4% on ARC-AGI-2. This highlights how the new benchmark significantly raises the bar for AI tests, measuring general fluid intelligence rather than memorized skills.

Recommended read:
References :
  • RunPod Blog: The race toward artificial general intelligence isn't just happening behind closed doors at trillion-dollar tech companies. It's also unfolding in the open—in research labs, Discord servers, GitHub repos, and competitions like the ARC Prize. This year, the ARC Prize Foundation is back with ARC-AGI-2
  • Data Phoenix: The ARC Prize Foundation has officially released the ARC-AGI-2 to challenge current foundation models and help track progress towards AGI. Additionally, the Foundation has opened the ARC Prize 2025, running from Mar 26 to Nov 3, with a $700K Grand Prize for an 85% scoring solution on the ARC-AGI-2.
  • THE DECODER: The new AI benchmark ARC-AGI-2 significantly raises the bar for AI tests. While humans can easily solve the tasks, even highly developed AI systems such as OpenAI o3 clearly fail. The article appeared first on .
  • eWEEK: The newest AI benchmark, ARC-AGI-2, builds on the first iteration by blocking brute force techniques and designing new tasks for next-gen AI systems. The post appeared first on .

@Google DeepMind Blog //
References: Google DeepMind Blog , AI News ,
ARC Prize has launched ARC-AGI-2, its toughest AI benchmark yet, accompanied by the announcement of their 2025 competition with $1 million in prizes. ARC-AGI-2 aims to push the limits of general and adaptive AI. As AI progresses beyond narrow tasks to general intelligence, these challenges aim to uncover capability gaps and actively guide innovation. ARC-AGI-2 is designed to be relatively easy for humans, who can solve every task in under two attempts, yet hard or impossible for AI, focusing on areas like symbolic interpretation, compositional reasoning, and contextual rule application.

The benchmark includes datasets with varying visibility and includes the following characteristics: symbolic interpretation, compositional reasoning and contextual rule application. Most existing benchmarks focus on superhuman capabilities, testing advanced, specialised skills. The competition challenges AI developers to attain an 85% accuracy rating on ARC-AGI-2’s private evaluation dataset.

Recommended read:
References :
  • Google DeepMind Blog: FACTS Grounding: A new benchmark for evaluating the factuality of large language models
  • AI News: ARC Prize launches its toughest AI benchmark yet: ARC-AGI-2
  • eWEEK: New AI Benchmark ARC-AGI-2 ‘Significantly Raises the Bar for AI’

@the-decoder.com //
OpenAI's o3 model is facing scrutiny after achieving record-breaking results on the FrontierMath benchmark, an AI math test developed by Epoch AI. It has emerged that OpenAI quietly funded the development of FrontierMath, and had prior access to the benchmark's datasets. The company's involvement was not disclosed until the announcement of o3's unprecedented performance, where it achieved a 25.2% accuracy rate, a significant jump from the 2% scores of previous models. This lack of transparency has drawn comparisons to the Theranos scandal, raising concerns about potential data manipulation and biased results. Epoch AI's associate director has admitted the lack of transparency was a mistake.

The controversy has sparked debate within the AI community, with questions being raised about the legitimacy of o3's performance. While OpenAI claims the data wasn't used for model training, concerns linger as six mathematicians who contributed to the benchmark said that they were not aware of OpenAI's involvement or the company having exclusive access. They also indicated that had they known, they might not have contributed to the project. Epoch AI has said that an "unseen-by-OpenAI hold-out set" was used to verify the model's capabilities. Now, Epoch AI is working on developing new hold-out questions to retest the o3 model's performance, ensuring OpenAI does not have prior access.

Recommended read:
References :
  • Analytics India Magazine: The company has had prior access to datasets of a benchmark the o3 model scored record results on. 
  • the-decoder.com: OpenAI's involvement in funding FrontierMath, a leading AI math benchmark, only came to light when the company announced its record-breaking performance on the test.
  • THE DECODER: OpenAI's involvement in funding FrontierMath, a leading AI math benchmark, only came to light when the company announced its record-breaking performance on the test. Now, the benchmark's developer Epoch AI acknowledges they should have been more transparent about the relationship.
  • LessWrong: Some lessons from the OpenAI-FrontierMath debacle
  • Pivot to AI: OpenAI o3 beats FrontierMath — because OpenAI funded the test and had access to the questions