@www.marktechpost.com
//
Apple researchers are challenging the perceived reasoning capabilities of Large Reasoning Models (LRMs), sparking debate within the AI community. A recent paper from Apple, titled "The Illusion of Thinking," suggests that these models, which generate intermediate thinking steps like Chain-of-Thought reasoning, struggle with fundamental reasoning tasks. The research indicates that current evaluation methods relying on math and code benchmarks are insufficient, as they often suffer from data contamination and fail to assess the structure or quality of the reasoning process.
To address these shortcomings, Apple researchers introduced controllable puzzle environments, including the Tower of Hanoi, River Crossing, Checker Jumping, and Blocks World, allowing for precise manipulation of problem complexity. These puzzles require diverse reasoning abilities, such as constraint satisfaction and sequential planning, and are free from data contamination. The Apple paper concluded that state-of-the-art LRMs ultimately fail to develop generalizable problem-solving capabilities, with accuracy collapsing to zero beyond certain complexities across different environments. However, the Apple research has faced criticism. Experts, like Professor Seok Joon Kwon, argue that Apple's lack of high-performance hardware, such as a large GPU-based cluster comparable to those operated by Google or Microsoft, could be a factor in their findings. Some argue that the models perform better on familiar puzzles, suggesting that their success may be linked to training exposure rather than genuine problem-solving skills. Others, such as Alex Lawsen and "C. Opus," argue that the Apple researchers' results don't support claims about fundamental reasoning limitations, but rather highlight engineering challenges related to token limits and evaluation methods. Recommended read:
References :
nftjedi@chatgptiseatingtheworld.com
//
Apple researchers recently published a study titled "The Illusion of Thinking," suggesting that advanced language models (LLMs) struggle with true reasoning, relying instead on pattern matching. The study presented findings based on tasks like the Tower of Hanoi puzzle, where models purportedly failed when complexity increased, leading to the conclusion that these models possess limited problem-solving abilities. However, these conclusions are now under scrutiny, with critics arguing the experiments were not fairly designed.
Alex Lawsen of Open Philanthropy has published a counter-study challenging the foundations of Apple's claims. Lawsen argues that models like Claude, Gemini, and OpenAI's latest systems weren't failing due to cognitive limits, but rather because the evaluation methods didn't account for key technical constraints. One issue raised was that models were often cut off from providing full answers because they neared their maximum token limit, a built-in cap on output text, which Apple's evaluation counted as a reasoning failure rather than a practical limitation. Another point of contention involved the River Crossing test, where models faced unsolvable problem setups. When the models correctly identified the tasks as impossible and refused to attempt them, they were still marked wrong. Furthermore, the evaluation system strictly judged outputs against exhaustive solutions, failing to credit models for partial but correct answers, pattern recognition, or strategic shortcuts. To illustrate, Lawsen demonstrated that when models were instructed to write a program to solve the Hanoi puzzle, they delivered accurate, scalable solutions even with 15 disks, contradicting Apple's assertion of limitations. Recommended read:
References :
Mark Gurman@Bloomberg Technology
//
Apple is facing delays in the release of its AI-powered Siri upgrade, now reportedly slated for Spring 2026 with the iOS 26.4 update. This news follows the recent WWDC 2025 event, where AI features were showcased across various Apple operating systems, but the highly anticipated Siri overhaul was notably absent. Sources indicate that the delay stems from challenges in integrating older Siri systems with newer platforms, forcing engineers to rebuild the assistant from scratch. Craig Federighi, Apple’s head of software engineering, explained that the previous V1 architecture was insufficient for achieving the desired quality, prompting a shift to a "deeper end-to-end architecture" known as V2.
This delay has also reportedly caused internal tensions within Apple, with the AI and marketing teams allegedly blaming each other for overpromising and failing to meet timelines. While no exact date has been finalized for the iOS 26.4 release, insiders suggest a spring timeframe, aligning with Apple's typical release schedule for ".4" updates. The upgraded Siri is expected to offer smarter responses, improved app control, and on-screen awareness, allowing it to tap into users' personal context and perform actions based on what's displayed on their devices. Separately, Apple researchers have revealed structural failures in large reasoning models (LRMs) through puzzle-based evaluations. A recently released Apple research paper claimed that contemporary AI LLMs and LRMs fail to make sound judgements as the complexity of problems in controlled puzzle environments they were tasked to solve increased, revealing their fundamental limitations and debunking the common belief that these models can think like a human being. This work, conducted using puzzles like the Tower of Hanoi and River Crossing, aimed to assess the true reasoning capabilities of AI models by analyzing their performance on unfamiliar tasks, free from data contamination. Professor Seok Joon Kwon of Sungkyunkwan University believes Apple does not have enough high-performance hardware to test what high-end LRMs and LLMs are truly capable of. Recommended read:
References :
@machinelearning.apple.com
//
Apple researchers have released a new study questioning the capabilities of Large Reasoning Models (LRMs), casting doubt on the industry's pursuit of Artificial General Intelligence (AGI). The research paper, titled "The Illusion of Thinking," reveals that these models, including those from OpenAI, Google DeepMind, Anthropic, and DeepSeek, experience a 'complete accuracy collapse' when faced with complex problems. Unlike existing evaluations primarily focused on mathematical and coding benchmarks, this study evaluates the reasoning traces of these models, offering insights into how LRMs "think".
Researchers tested various models, including OpenAI's o3-mini, DeepSeek-R1, and Claude 3.7 Sonnet, using puzzles like the Tower of Hanoi, Checker Jumping, River Crossing, and Blocks World. These environments allowed for the manipulation of complexity while maintaining consistent logical structures. The team discovered that standard language models surprisingly outperformed LRMs in low-complexity scenarios, while LRMs only demonstrated advantages in medium-complexity tasks. However, all models experienced a performance collapse when faced with highly complex tasks. The study suggests that the so-called reasoning of LRMs may be more akin to sophisticated pattern matching, which is fragile and prone to failure when challenged with significant complexity. Apple's research team identified three distinct performance regimes: low-complexity tasks where standard models outperform LRMs, medium-complexity tasks where LRMs show advantages, and high-complexity tasks where all models collapse. Apple has begun integrating powerful generative AI into its own apps and experiences. The new Foundation Models framework gives app developers access to the on-device foundation language model. Recommended read:
References :
Alexey Shabanov@TestingCatalog
//
OpenAI has recently unveiled its latest reasoning models, o3 and o4-mini, representing state-of-the-art advancements in AI capabilities. These models are designed with a focus on tool use and efficiency, leveraging reinforcement learning to intelligently utilize tools like web search, code interpreter, and memory. OpenAI's o3 demonstrates agentic capabilities, enabling it to function as a streamlined "Deep Research-Lite," capable of delivering rapid responses to complex queries within seconds or minutes, significantly faster than the existing Deep Research model.
While the o3 model excels on benchmarks such as the Aider polyglot coding benchmark, achieving a new state-of-the-art score of 79.6%, its high cost is a point of concern. The model's expense is estimated at $150 per million output tokens, marking a 15-fold increase over GPT-4o. The o4-mini offers a more cost-effective alternative, scoring 72% on the Aider benchmark while costing three times more than Gemini 2.5. However, a combination of o3 as a planner and GPT-4.1 can achieve an even higher score of 83% at 65% of the o3 cost, though this remains an expensive option. Despite the cost concerns, the agentic nature of o3 allows it to overcome limitations associated with LLM-based searches. By actively planning and using tools iteratively, it provides coherent and complete answers, automatically performing multiple web searches to find up-to-date information. OpenAI is also experimenting with a "Deep Research Mini" tool for free ChatGPT users, powered by a version of o4-mini, aiming to democratize access to advanced AI reasoning capabilities. In related news, The Washington Post has partnered with OpenAI to integrate its journalism into ChatGPT’s search experience, ensuring that users receive summaries, quotes, and direct links to the publication's reporting. Recommended read:
References :
@www.analyticsvidhya.com
//
OpenAI recently unveiled its groundbreaking o3 and o4-mini AI models, representing a significant leap in visual problem-solving and tool-using artificial intelligence. These models can manipulate and reason with images, integrating them directly into their problem-solving process. This unlocks a new class of problem-solving that blends visual and textual reasoning, allowing the AI to not just see an image, but to "think with it." The models can also autonomously utilize various tools within ChatGPT, such as web search, code execution, file analysis, and image generation, all within a single task flow.
These models are designed to improve coding capabilities, and the GPT-4.1 series includes GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano. GPT-4.1 demonstrates enhanced performance and lower prices, achieving a 54.6% score on SWE-bench Verified, a significant 21.4 percentage point increase from GPT-4o. This is a big gain in practical software engineering capabilities. Most notably, GPT-4.1 offers up to one million tokens of input context, compared to GPT-4o's 128k tokens, making it suitable for processing large codebases and extensive documentation. GPT-4.1 mini and nano also offer performance boosts at reduced latency and cost. The new models are available to ChatGPT Plus, Pro, and Team users, with Enterprise and education users gaining access soon. While reasoning alone isn't a silver bullet, it reliably improves model accuracy and problem-solving capabilities on challenging tasks. With Deep Research products and o3/o4-mini, AI-assisted search-based research is now effective. Recommended read:
References :
@www.analyticsvidhya.com
//
OpenAI has recently launched its o3 and o4-mini models, marking a shift towards AI agents with enhanced tool-use capabilities. These models are specifically designed to excel in areas such as web search, code interpretation, and memory utilization, leveraging reinforcement learning to optimize their performance. The focus is on creating AI that can intelligently use tools in a loop, behaving more like a streamlined and rapid-response system for complex tasks. The development underscores a growing industry trend of major AI labs delivering inference-optimized models ready for immediate deployment.
The o3 model stands out for its ability to provide quick answers, often within 30 seconds to three minutes, a significant improvement over the longer response times of previous models. This speed is coupled with integrated tool use, making it suitable for real-world applications requiring quick, actionable insights. Another key advantage of o3 is its capability to manipulate image inputs using code, allowing it to identify key features by cropping and zooming, which has been demonstrated in tasks such as the "GeoGuessr" game. While o3 demonstrates strengths across various benchmarks, tests have also shown variances in performance compared to other models like Gemini 2.5 and even its smaller counterpart, o4-mini. While o3 leads on most benchmarks and set a new state-of-the-art with 79.60% on the Aider polyglot coding benchmark, the costs are much higher. However, when used as a planner and GPT-4.1, the pair scored a new SOTA with 83% at 65% of the cost, though still expensive. One analysis notes the importance of context awareness when iterating on code, which Gemini 2.5 seems to handle better than o3 and o4-mini. Overall, the models represent OpenAI's continued push towards more efficient and agentic AI systems. Recommended read:
References :
@www.analyticsvidhya.com
//
OpenAI's latest AI models, o3 and o4-mini, have been released with enhanced problem-solving capabilities and improved tool use, promising a step change in the ability of language models to tackle complex tasks. These reasoning models, now available to ChatGPT Plus, Pro, and Team users, demonstrate stronger proficiency in mathematical solutions, programming work, and even image interpretation. One notable feature is o3's native support for tool use, allowing it to organically utilize code execution, file retrieval, and web search during its reasoning process, a crucial aspect for modern Large Language Model (LLM) applications and agentic systems.
However, despite these advancements, the o3 and o4-mini models are facing criticism due to higher hallucination rates compared to older versions. These models tend to make up facts and present them as reality, a persistent issue that OpenAI is actively working to address. Internal tests show that o3 gives wrong answers 33% of the time when asked about people, nearly double the hallucination rate observed in past models. In one test, o3 claimed it ran code on a MacBook laptop outside of ChatGPT, illustrating how the model sometimes invents steps to appear smarter. This increase in hallucinations raises concerns about the models' reliability for serious professional applications. For instance, lawyers could receive fake details in legal documents, doctors might get incorrect medical advice, and teachers could see wrong answers in student homework help. Although OpenAI considers hallucination repair a main operational goal, the exact cause and solution remain elusive. One proposed solution involves connecting the AI to the internet for fact-checking, similar to how GPT-4o achieves higher accuracy with web access. However, this approach raises privacy concerns related to sharing user questions with search engines. Recommended read:
References :
|
BenchmarksBlogsResearch Tools |