Asif Razzaq@MarkTechPost
//
OpenAI has unveiled PaperBench, a new benchmark designed to rigorously assess the ability of AI agents to autonomously replicate cutting-edge machine learning research. The benchmark consists of 20 papers from ICML 2024, spanning areas like reinforcement learning and probabilistic methods. PaperBench measures if AI systems can accurately interpret research papers, independently develop codebases, and execute experiments to replicate empirical outcomes. To ensure genuine independent replication, agents are prohibited from referencing original authors' code.
The effort involves systematic evaluation tools and detailed rubrics, co-developed with original paper authors, specifying 8,316 individually gradable tasks to facilitate precise evaluation of AI capabilities. OpenAI is also escalating competition with Anthropic by offering free ChatGPT Plus subscriptions to college students in the US and Canada through the end of May. This move gives millions of students access to OpenAI’s premium service just as they prepare for final exams, providing capabilities like GPT-4o, image generation, voice interaction, and advanced research tools. Recommended read:
References :
Ellie Ramirez-Camara@Data Phoenix
//
The ARC Prize Foundation has launched ARC-AGI-2, a new AI benchmark designed to challenge current foundation models and track progress towards artificial general intelligence (AGI). Building on the original ARC benchmark, ARC-AGI-2 blocks brute force techniques and introduces new tasks intended for next-generation AI systems. The goal is to evaluate real progress toward AGI by requiring models to reason abstractly, generalize from few examples, and apply knowledge in new contexts, tasks that are simple for humans but difficult for machines.
The Foundation has also announced the ARC Prize 2025, a competition running from March 26 to November 3, with a grand prize of $700,000 for a solution achieving an 85% score on the ARC-AGI-2 benchmark's private evaluation dataset. Early testing results show that even OpenAI's top models experienced a significant performance drop, with o3 falling from 75% to approximately 4% on ARC-AGI-2. This highlights how the new benchmark significantly raises the bar for AI tests, measuring general fluid intelligence rather than memorized skills. Recommended read:
References :
@Google DeepMind Blog
//
References:
Google DeepMind Blog
, AI News
,
ARC Prize has launched ARC-AGI-2, its toughest AI benchmark yet, accompanied by the announcement of their 2025 competition with $1 million in prizes. ARC-AGI-2 aims to push the limits of general and adaptive AI. As AI progresses beyond narrow tasks to general intelligence, these challenges aim to uncover capability gaps and actively guide innovation. ARC-AGI-2 is designed to be relatively easy for humans, who can solve every task in under two attempts, yet hard or impossible for AI, focusing on areas like symbolic interpretation, compositional reasoning, and contextual rule application.
The benchmark includes datasets with varying visibility and includes the following characteristics: symbolic interpretation, compositional reasoning and contextual rule application. Most existing benchmarks focus on superhuman capabilities, testing advanced, specialised skills. The competition challenges AI developers to attain an 85% accuracy rating on ARC-AGI-2’s private evaluation dataset. Recommended read:
References :
@the-decoder.com
//
OpenAI's o3 model is facing scrutiny after achieving record-breaking results on the FrontierMath benchmark, an AI math test developed by Epoch AI. It has emerged that OpenAI quietly funded the development of FrontierMath, and had prior access to the benchmark's datasets. The company's involvement was not disclosed until the announcement of o3's unprecedented performance, where it achieved a 25.2% accuracy rate, a significant jump from the 2% scores of previous models. This lack of transparency has drawn comparisons to the Theranos scandal, raising concerns about potential data manipulation and biased results. Epoch AI's associate director has admitted the lack of transparency was a mistake.
The controversy has sparked debate within the AI community, with questions being raised about the legitimacy of o3's performance. While OpenAI claims the data wasn't used for model training, concerns linger as six mathematicians who contributed to the benchmark said that they were not aware of OpenAI's involvement or the company having exclusive access. They also indicated that had they known, they might not have contributed to the project. Epoch AI has said that an "unseen-by-OpenAI hold-out set" was used to verify the model's capabilities. Now, Epoch AI is working on developing new hold-out questions to retest the o3 model's performance, ensuring OpenAI does not have prior access. Recommended read:
References :
|
BenchmarksBlogsResearch Tools |