News from the AI & ML world

DeeperML - #llms

nftjedi@chatgptiseatingtheworld.com //
Apple researchers recently published a study titled "The Illusion of Thinking," suggesting that advanced language models (LLMs) struggle with true reasoning, relying instead on pattern matching. The study presented findings based on tasks like the Tower of Hanoi puzzle, where models purportedly failed when complexity increased, leading to the conclusion that these models possess limited problem-solving abilities. However, these conclusions are now under scrutiny, with critics arguing the experiments were not fairly designed.

Alex Lawsen of Open Philanthropy has published a counter-study challenging the foundations of Apple's claims. Lawsen argues that models like Claude, Gemini, and OpenAI's latest systems weren't failing due to cognitive limits, but rather because the evaluation methods didn't account for key technical constraints. One issue raised was that models were often cut off from providing full answers because they neared their maximum token limit, a built-in cap on output text, which Apple's evaluation counted as a reasoning failure rather than a practical limitation.

Another point of contention involved the River Crossing test, where models faced unsolvable problem setups. When the models correctly identified the tasks as impossible and refused to attempt them, they were still marked wrong. Furthermore, the evaluation system strictly judged outputs against exhaustive solutions, failing to credit models for partial but correct answers, pattern recognition, or strategic shortcuts. To illustrate, Lawsen demonstrated that when models were instructed to write a program to solve the Hanoi puzzle, they delivered accurate, scalable solutions even with 15 disks, contradicting Apple's assertion of limitations.

Recommended read:
References :
  • chatgptiseatingtheworld.com: Research: Did Apple researchers overstate “The Illusion of Thinking†in reasoning models. Opus, Lawsen think so.
  • Digital Information World: Apple’s AI Critique Faces Pushback Over Flawed Testing Methods
  • NextBigFuture.com: Apple Researcher Claims Illusion of AI Thinking Versus OpenAI Solving Ten Disk Puzzle
  • Bernard Marr: Beyond The Hype: What Apple's AI Warning Means For Business Leaders

Emilia David@AI News | VentureBeat //
Google's Gemini 2.5 Pro is making waves in the AI landscape, with claims of superior coding performance compared to leading models like DeepSeek R1 and Grok 3 Beta. The updated Gemini 2.5 Pro, currently in preview, is touted to deliver faster and more creative responses, particularly in coding and reasoning tasks. Google highlighted improvements across key benchmarks such as AIDER Polyglot, GPQA, and HLE, noting a significant Elo score jump since the previous version. This newest iteration, referred to as Gemini 2.5 Pro Preview 06-05, builds upon the I/O edition released earlier in May, promising even better performance and enterprise-scale capabilities.

Google is also planning several enhancements to the Gemini platform. These include upgrades to Canvas, Gemini’s workspace for organizing and presenting ideas, adding the ability to auto-generate infographics, timelines, mindmaps, full presentations, and web pages. There are also plans to integrate Imagen 4, which enhances image generation capabilities, image-to-video functionality, and an Enterprise mode, which offers a dedicated toggle to separate professional and personal workflows. This Enterprise mode aims to provide business users with clearer boundaries and improved data governance within the platform.

In addition to its coding prowess, Gemini 2.5 Pro boasts native audio capabilities, enabling developers to build richer and more interactive applications. Google emphasizes its proactive approach to safety and responsibility, embedding SynthID watermarking technology in all audio outputs to ensure transparency and identifiability of AI-generated audio. Developers can explore these native audio features through the Gemini API in Google AI Studio or Vertex AI, experimenting with audio dialog and controllable speech generation. Google DeepMind is also exploring ways for AI to take over mundane email chores, with CEO Demis Hassabis envisioning an AI assistant capable of sorting, organizing, and responding to emails in a user's own voice and style.

Recommended read:
References :
  • AI News | VentureBeat: Google claims Gemini 2.5 Pro preview beats DeepSeek R1 and Grok 3 Beta in coding performance
  • learn.aisingapore.org: Gemini 2.5’s native audio capabilities
  • Kyle Wiggers ?: Google says its updated Gemini 2.5 Pro AI model is better at coding
  • www.techradar.com: Google upgrades Gemini 2.5 Pro's already formidable coding abilities
  • SiliconANGLE: Google revamps Gemini 2.5 Pro again, claiming superiority in coding and math
  • siliconangle.com: SiliconAngle reports on Google's release of an updated Gemini 2.5 Pro model, highlighting its claimed superiority in coding and math.

Tulsee Doshi@The Official Google Blog //
Google has launched an upgraded preview of Gemini 2.5 Pro, touting it as their most intelligent model yet. Building upon the version revealed in May, this updated AI demonstrates significant improvements in coding capabilities. One striking example of its advanced functionality is its ability to generate intricate images, such as a "pretty solid pelican riding a bicycle."

The model's enhanced coding proficiency is further highlighted by its ethical safeguards. When prompted to run SnitchBench, a tool designed to test the ethical boundaries of AI models, Gemini 2.5 Pro notably "tipped off both the feds and the WSJ and NYTimes." This self-awareness and alert system underscore the advancements in AI safety protocols integrated into the new model.

The rapid development and release of Gemini 2.5 Pro reflect Google's increasing confidence in its AI technology. The company emphasizes that this iteration offers substantial improvements over its predecessors, solidifying its position as a leading AI model. Developers and enthusiasts alike are encouraged to try the latest Gemini 2.5 Pro before its general release to experience its improved capabilities firsthand.

Recommended read:
References :
  • Kyle Wiggers ?: Google says its updated Gemini 2.5 Pro AI model is better at coding
  • The Official Google Blog: We’re introducing an upgraded preview of Gemini 2.5 Pro, our most intelligent model yet. Building on the version we released in May and showed at I/O, this model will be…
  • THE DECODER: Google has rolled out another update to its flagship AI model, Gemini 2.5 Pro. The latest version brings modest improvements across a range of benchmarks and maintains top positions on tests like LMArena and WebDevArena The article appeared first on .
  • www.zdnet.com: The flagship model's rapid evolution reflects Google's growing confidence in its AI offerings.
  • bsky.app: New Gemini 2.5 Pro is out - gemini-2.5-pro-preview-06-05 It made me a pretty solid pelican riding a bicycle, AND it tipped off both the feds and the WSJ and NYTimes when I tried running SnitchBench against it https://simonwillison.net/2025/Jun/5/gemini-25-pro-preview-06-05/
  • Simon Willison: New Gemini 2.5 Pro is out - gemini-2.5-pro-preview-06-05 It made me a pretty solid pelican riding a bicycle, AND it tipped off both the feds and the WSJ and NYTimes when I tried running SnitchBench against it
  • AI News | VentureBeat: Google claims Gemini 2.5 Pro preview beats DeepSeek R1 and Grok 3 Beta in coding performance
  • www.techradar.com: Google upgrades Gemini 2.5 Pro's already formidable coding abilities
  • SiliconANGLE: Google revamps Gemini 2.5 Pro again, claiming superiority in coding and math
  • siliconangle.com: Google revamps Gemini 2.5 Pro again, claiming superiority in coding and math
  • the-decoder.com: Google Rolls Out Modest Improvements to Gemini 2.5 Pro
  • www.marktechpost.com: Google Introduces Open-Source Full-Stack AI Agent Stack Using Gemini 2.5 and LangGraph for Multi-Step Web Search, Reflection, and Synthesis
  • Maginative: Maginative article about how Google quietly upgraded Gemini 2.5 Pro.
  • Stack Overflow Blog: Ryan and Ben welcome Tulsee Doshi and Logan Kilpatrick from Google's DeepMind to discuss the advanced capabilities of the new Gemini 2.5, the importance of feedback loops for model improvement and reducing hallucinations, the necessity of great data for advancements, and enhancing developer experience through tool integration.

@www.linkedin.com //
Nvidia's Blackwell GPUs have achieved top rankings in the latest MLPerf Training v5.0 benchmarks, demonstrating breakthrough performance across various AI workloads. The NVIDIA AI platform delivered the highest performance at scale on every benchmark, including the most challenging large language model (LLM) test, Llama 3.1 405B pretraining. Nvidia was the only vendor to submit results on all MLPerf Training v5.0 benchmarks, highlighting the versatility of the NVIDIA platform across a wide array of AI workloads, including LLMs, recommendation systems, multimodal LLMs, object detection, and graph neural networks.

The at-scale submissions used two AI supercomputers powered by the NVIDIA Blackwell platform: Tyche, built using NVIDIA GB200 NVL72 rack-scale systems, and Nyx, based on NVIDIA DGX B200 systems. Nvidia collaborated with CoreWeave and IBM to submit GB200 NVL72 results using a total of 2,496 Blackwell GPUs and 1,248 NVIDIA Grace CPUs. The GB200 NVL72 systems achieved 90% scaling efficiency up to 2,496 GPUs, improving time-to-convergence by up to 2.6x compared to Hopper-generation H100.

The new MLPerf Training v5.0 benchmark suite introduces a pretraining benchmark based on the Llama 3.1 405B generative AI system, the largest model to be introduced in the training benchmark suite. On this benchmark, Blackwell delivered 2.2x greater performance compared with the previous-generation architecture at the same scale. Furthermore, on the Llama 2 70B LoRA fine-tuning benchmark, NVIDIA DGX B200 systems, powered by eight Blackwell GPUs, delivered 2.5x more performance compared with a submission using the same number of GPUs in the prior round. These performance gains highlight advancements in the Blackwell architecture and software stack, including high-density liquid-cooled racks, fifth-generation NVLink and NVLink Switch interconnect technologies, and NVIDIA Quantum-2 InfiniBand networking.

Recommended read:
References :
  • NVIDIA Newsroom: NVIDIA Blackwell Delivers Breakthrough Performance in Latest MLPerf Training Results
  • NVIDIA Technical Blog: NVIDIA Blackwell Delivers up to 2.6x Higher Performance in MLPerf Training v5.0
  • IEEE Spectrum: Nvidia’s Blackwell Conquers Largest LLM Training Benchmark
  • NVIDIA Technical Blog: Reproducing NVIDIA MLPerf v5.0 Training Scores for LLM Benchmarks
  • AI News | VentureBeat: Nvidia says its Blackwell chips lead benchmarks in training AI LLMs
  • MLCommons: New MLCommons MLPerf Training v5.0 Benchmark Results Reflect Rapid Growth and Evolution of the Field of AI
  • www.aiwire.net: MLPerf Training v5.0 results show Nvidia’s Blackwell GB200 accelerators sprinting through record time-to-train scores.
  • blogs.nvidia.com: NVIDIA is working with companies worldwide to build out AI factories — speeding the training and deployment of next-generation AI applications that use the latest advancements in training and inference. The NVIDIA Blackwell architecture is built to meet the heightened performance requirements of these new applications. In the latest round of MLPerf Training — the
  • mlcommons.org: New MLCommons MLPerf Training v5.0 Benchmark Results Reflect Rapid Growth and Evolution of the Field of AI
  • NVIDIA Newsroom: NVIDIA RTX Blackwell GPUs Accelerate Professional-Grade Video Editing
  • ServeTheHome: The new MLPerf Training v5.0 are dominated by NVIDIA Blackwell and Hopper results, but we also get AMD Instinct MI325X on a benchmark as well
  • AIwire: This is a news article on nvidia Blackwell GPUs lift Nvidia to the top of MLPerf Training Rankings
  • IEEE Spectrum: Nvidia’s Blackwell Conquers Largest LLM Training Benchmark
  • www.servethehome.com: MLPerf Training v5.0 is Out

@www.quantamagazine.org //
Researchers are making strides in AI reasoning and efficiency, tackling both complex problem-solving and the energy consumption of these systems. One promising area involves reversible computing, where programs can run backward as easily as forward, theoretically saving energy by avoiding data deletion. Michael Frank, a researcher interested in the physical limits of computation, discovered that reversible computing could keep computational progress going as traditional computing slows due to physical limitations. Christof Teuscher at Portland State University emphasized the potential for significant power savings with this approach.

An evolution of the LLM-as-a-Judge paradigm is emerging. Meta AI has introduced the J1 framework which shifts the paradigm of LLMs from passive generators to active, deliberative evaluators through self-evaluation. This approach, detailed in "J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning," addresses the growing need for rigorous and scalable evaluation as AI systems become more capable and widely deployed. By reframing judgment as a structured reasoning task trained through reinforcement learning, J1 aims to create models that perform consistent, interpretable, and high-fidelity evaluations.

Soheil Feizi, an associate professor at the University of Maryland, has received a $1 million federal grant to advance foundational research in reasoning AI models. This funding, stemming from a Presidential Early Career Award for Scientists and Engineers (PECASE), will support his work in defending large language models (LLMs) against attacks, identifying weaknesses in how these models learn, encouraging transparent, step-by-step logic, and understanding the "reasoning tokens" that drive decision-making. Feizi plans to explore innovative approaches like live activation probing and novel reinforcement-learning designs, aiming to transform theoretical advancements into practical applications and real-world usages.

Recommended read:
References :

@www.marktechpost.com //
Large Language Models (LLMs) are facing significant challenges in handling real-world conversations, particularly those involving multiple turns and underspecified tasks. Researchers from Microsoft and Salesforce have recently revealed a substantial performance drop of 39% in LLMs when confronted with such conversational scenarios. This decline highlights the difficulty these models have in maintaining contextual coherence and delivering accurate outcomes as conversations evolve and new information is incrementally introduced. Instead of flexibly adjusting to changing user inputs, LLMs often make premature assumptions, leading to errors that persist throughout the dialogue.

These findings underscore a critical gap in how LLMs are currently evaluated. Traditional benchmarks often rely on single-turn, fully-specified prompts, which fail to capture the complexities of real-world interactions where information is fragmented and context must be actively constructed from multiple exchanges. This discrepancy between evaluation methods and actual conversational demands contributes to the challenges LLMs face in integrating underspecified inputs and adapting to evolving user needs. The research emphasizes the need for new evaluation frameworks that better reflect the dynamic and iterative nature of real-world conversations.

In contrast to these challenges, Google's DeepMind has developed AlphaEvolve, an AI agent designed to optimize code and reclaim computational resources. AlphaEvolve autonomously rewrites critical code, resulting in a 0.7% reduction in Google's overall compute usage. This system not only pays for itself but also demonstrates the potential for AI agents to significantly improve efficiency in complex computational environments. AlphaEvolve's architecture, featuring a controller, fast-draft models, deep-thinking models, automated evaluators, and versioned memory, represents a production-grade approach to agent engineering. This allows for continuous improvement at scale.

Recommended read:
References :
  • AI News | VentureBeat: Google’s AlphaEvolve: The AI agent that reclaimed 0.7% of Google’s compute – and how to copy it.
  • MarkTechPost: LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified Tasks.

Kevin Okemwa@windowscentral.com //
OpenAI has announced the release of GPT-4.1 and GPT-4.1 mini, the latest iterations of their large language models, now accessible within ChatGPT. This move marks the first time GPT-4.1 is available outside of the API, opening up its capabilities to a broader user base. GPT-4.1 is designed as a specialized model that excels at coding tasks and instruction following, making it a valuable tool for developers and users with coding needs. OpenAI is making the models accessible via the “more models” dropdown selection in the top corner of the chat window within ChatGPT, giving users the flexibility to choose between GPT-4.1, GPT-4.1 mini, and other models.

The GPT-4.1 model is being rolled out to paying subscribers of ChatGPT Plus, Pro, and Team, with Enterprise and Education users expected to gain access in the coming weeks. For free users, OpenAI is introducing GPT-4.1 mini, which replaces GPT-4o mini as the default model once the daily GPT-4o limit is reached. The "mini" version provides a smaller-scale parameter and less powerful version with similar safety standards. OpenAI’s decision to add GPT-4.1 to ChatGPT was driven by popular demand, despite initially planning to keep it exclusive to the API.

GPT-4.1 was built prioritizing developer needs and production use cases. The company claims GPT-4.1 delivers a 21.4-point improvement over GPT-4o on the SWE-bench Verified software engineering benchmark, and a 10.5-point gain on instruction-following tasks in Scale’s MultiChallenge benchmark. In addition, it reduces verbosity by 50% compared to other models, a trait enterprise users praised during early testing. The model supports standard context windows for ChatGPT, ranging from 8,000 tokens for free users to 128,000 tokens for Pro users.

Recommended read:
References :
  • THE DECODER: OpenAI is rolling out its GPT-4.1 model to ChatGPT, making it available outside the API for the first time.
  • AI News | VentureBeat: OpenAI is rolling out GPT-4.1, its new non-reasoning large language model (LLM) that balances high performance with lower cost, to users of ChatGPT.
  • www.techradar.com: ChatGPT 4.1 and 4.1 mini are now available, bringing improvements to coding and the ability to follow tasks.
  • Simon Willison's Weblog: By popular request, GPT-4.1 will be available directly in ChatGPT starting today. GPT-4.1 is a specialized model that excels at coding tasks & instruction following. Because it’s faster, it’s a great alternative to OpenAI o3 & o4-mini for everyday coding needs.
  • gHacks Technology News: OpenAI has announced that ChatGPT users can now access GPT-4.1 and GPT-4.1 mini AI models. The good news is that GPT-4.1 mini is available for free users.
  • Maginative: OpenAI Brings GPT-4.1 to ChatGPT
  • www.windowscentral.com: OpenAI is bringing GPT-4.1 and GPT-4.1 minito ChatGPT, and the new AI models excel in web development and coding tasks compared to OpenAI o3 & o4-mini.

Matthias Bastian@THE DECODER //
OpenAI has announced the integration of GPT-4.1 and GPT-4.1 mini models into ChatGPT, aimed at enhancing coding and web development capabilities. The GPT-4.1 model, designed as a specialized model excelling at coding tasks and instruction following, is now available to ChatGPT Plus, Pro, and Team users. According to OpenAI, GPT-4.1 is faster and a great alternative to OpenAI o3 & o4-mini for everyday coding needs, providing more help to developers creating applications.

OpenAI is also rolling out GPT-4.1 mini, which will be available to all ChatGPT users, including those on the free tier, replacing the previous GPT-4o mini model. This model serves as the fallback option once GPT-4o usage limits are reached. The release notes confirm that GPT 4.1 mini offers various improvements over GPT-4o mini, including instruction-following, coding, and overall intelligence. This initiative is part of OpenAI's effort to make advanced AI tools more accessible and useful for a broader audience, particularly those engaged in programming and web development.

Johannes Heidecke, Head of Systems at OpenAI, has emphasized that the new models build upon the safety measures established for GPT-4o, ensuring parity in safety performance. According to Heidecke, no new safety risks have been introduced, as GPT-4.1 doesn’t introduce new modalities or ways of interacting with the AI, and that it doesn’t surpass o3 in intelligence. The rollout marks another step in OpenAI's increasingly rapid model release cadence, significantly expanding access to specialized capabilities in web development and coding.

Recommended read:
References :
  • twitter.com: GPT-4.1 is a specialized model that excels at coding tasks & instruction following. Because it’s faster, it’s a great alternative to OpenAI o3 & o4-mini for everyday coding needs.
  • www.computerworld.com: OpenAI adds GPT-4.1 models to ChatGPT
  • gHacks Technology News: OpenAI releases GPT-4.1 and GPT-4.1 mini AI models for ChatGPT
  • Maginative: OpenAI Brings GPT-4.1 to ChatGPT
  • www.windowscentral.com: “Am I crazy or is GPT-4.1 the best model for coding?” ChatGPT gets new models with exemplary web development capabilities — but OpenAI is under fire for allegedly skimming through safety processes
  • the-decoder.com: OpenAI brings its new GPT-4.1 model to ChatGPT users
  • www.ghacks.net: OpenAI releases GPT-4.1 and GPT-4.1 mini AI models for ChatGPT
  • AI News | VentureBeat: OpenAI is rolling out GPT-4.1, its new non-reasoning large language model (LLM) that balances high performance with lower cost, to users of ChatGPT.
  • www.techradar.com: OpenAI just gave ChatGPT users a huge free upgrade – 4.1 mini is available today
  • www.marktechpost.com: OpenAI has introduced Codex, a cloud-native software engineering agent integrated into ChatGPT, signaling a new era in AI-assisted software development.

Kevin Okemwa@windowscentral.com //
OpenAI has launched GPT-4.1 and GPT-4.1 mini, the latest iterations of its language models, now integrated into ChatGPT. This upgrade aims to provide users with enhanced coding and instruction-following capabilities. GPT-4.1, available to paid ChatGPT subscribers including Plus, Pro, and Team users, excels at programming tasks and provides a smarter, faster, and more useful experience, especially for coders. Additionally, Enterprise and Edu users are expected to gain access in the coming weeks.

GPT-4.1 mini, on the other hand, is being introduced to all ChatGPT users, including those on the free tier, replacing the previous GPT-4o mini model. It serves as a fallback option when GPT-4o usage limits are reached. OpenAI says GPT-4.1 mini is a "fast, capable, and efficient small model". This approach democratizes access to improved AI, ensuring that even free users benefit from advancements in language model technology.

Both GPT-4.1 and GPT-4.1 mini demonstrate OpenAI's commitment to rapidly advancing its AI model offerings. Initial plans were to release GPT-4.1 via API only for developers, but strong user feedback changed that. The company claims GPT-4.1 excels at following specific instructions, is less "chatty", and is more thorough than older versions of GPT-4o. OpenAI also notes that GPT-4.1's safety performance is at parity with GPT-4o, showing improvements can be delivered without new safety risks.

Recommended read:
References :
  • Maginative: OpenAI has integrated its GPT-4.1 model into ChatGPT, providing enhanced coding and instruction-following capabilities to paid users, while also introducing GPT-4.1 mini for all users.
  • pub.towardsai.net: AI Passes Physician-Level Responses in OpenAI’s HealthBench
  • THE DECODER: OpenAI brings its new GPT-4.1 model to ChatGPT users
  • AI News | VentureBeat: OpenAI brings GPT-4.1 and 4.1 mini to ChatGPT — what enterprises should know
  • www.zdnet.com: OpenAI's HealthBench shows AI's medical advice is improving - but who will listen?
  • www.techradar.com: OpenAI just gave ChatGPT users a huge free upgrade – 4.1 mini is available today
  • Simon Willison's Weblog: GPT-4.1 will be available directly in ChatGPT starting today. GPT-4.1 is a specialized model that excels at coding tasks & instruction following.
  • www.windowscentral.com: OpenAI is bringing GPT-4.1 and GPT-4.1 minito ChatGPT, and the new AI models excel in web development and coding tasks compared to OpenAI o3 & o4-mini.
  • www.zdnet.com: GPT-4.1 makes ChatGPT smarter, faster, and more useful for paying users, especially coders
  • www.computerworld.com: OpenAI adds GPT-4.1 models to ChatGPT
  • gHacks Technology News: OpenAI releases GPT-4.1 and GPT-4.1 mini AI models for ChatGPT
  • twitter.com: By popular request, GPT-4.1 will be available directly in ChatGPT starting today. GPT-4.1 is a specialized model that excels at coding tasks & instruction following. Because it’s faster, it’s a great alternative to OpenAI o3 & o4-mini for everyday coding needs.
  • www.ghacks.net: Reports on GPT-4.1 and GPT-4.1 mini AI models in ChatGPT, noting their accessibility to both paid and free users.
  • x.com: Provides initial tweet about the availability of GPT-4.1 in ChatGPT.
  • the-decoder.com: OpenAI brings its new GPT-4.1 model to ChatGPT users
  • eWEEK: OpenAI rolls out GPT-4.1 and GPT-4.1 mini to ChatGPT, offering smarter coding and instruction-following tools for free and paid users.

@learn.aisingapore.org //
Anthropic's Claude 3.7 model is making waves in the AI community due to its enhanced reasoning capabilities, specifically through a "deep thinking" approach. This method utilizes chain-of-thought (CoT) techniques, enabling Claude 3.7 to tackle complex problems more effectively. This development represents a significant advancement in Large Language Model (LLM) technology, promising improved performance in a variety of demanding applications.

The implications of this enhanced reasoning are already being seen across different sectors. FloQast, for example, is leveraging Anthropic's Claude 3 on Amazon Bedrock to develop an AI-powered accounting transformation solution. The integration of Claude’s capabilities is assisting companies in streamlining their accounting operations, automating reconciliations, and gaining real-time visibility into financial operations. The model’s ability to handle the complexities of large-scale accounting transactions highlights its potential for real-world applications.

Furthermore, recent reports highlight the competitive landscape where models like Mistral AI's Medium 3 are being compared to Claude Sonnet 3.7. These comparisons focus on balancing performance, cost-effectiveness, and ease of deployment. Simultaneously, Anthropic is also enhancing Claude's functionality by allowing users to connect more applications, expanding its utility across various domains. These advancements underscore the ongoing research and development efforts aimed at maximizing the potential of LLMs and addressing potential security vulnerabilities.

Recommended read:
References :
  • learn.aisingapore.org: This article describes how FloQast utilizes Anthropic’s Claude 3 on Amazon Bedrock for its accounting transformation solution.
  • Last Week in AI: LWiAI Podcast #208 - Claude Integrations, ChatGPT Sycophancy, Leaderboard Cheats
  • techcrunch.com: Anthropic lets users connect more apps to Claude
  • Towards AI: The New AI Model Paradox: When “Upgrades” Feel Like Downgrades (Claude 3.7)
  • Towards AI: How to Achieve Structured Output in Claude 3.7: Three Practical Approaches

@the-decoder.com //
OpenAI is making strides in AI customization and application development with the release of Reinforcement Fine-Tuning (RFT) on its o4-mini reasoning model and the appointment of Fidji Simo as the CEO of Applications. The RFT release allows organizations to tailor their versions of the o4-mini model to specific tasks using custom objectives and reward functions, marking a significant advancement in model optimization. This approach utilizes reinforcement learning principles, where developers provide a task-specific grader that evaluates and scores model outputs based on custom criteria, enabling the model to optimize against a reward signal and align with desired behaviors.

Reinforcement Fine-Tuning is particularly valuable for complex or subjective tasks where ground truth is difficult to define. By using RFT on o4-mini, a compact reasoning model optimized for text and image inputs, developers can fine-tune for high-stakes, domain-specific reasoning tasks while maintaining computational efficiency. Early adopters have demonstrated the practical potential of RFT. This capability allows developers to tweak the model to better fit their needs using OpenAI's platform dashboard, deploy it through OpenAI's API, and connect it to internal systems.

In a move to scale its AI products, OpenAI has appointed Fidji Simo, formerly CEO of Instacart, as the CEO of Applications. Simo will oversee the scaling of AI products, leveraging her extensive experience in consumer tech to drive revenue generation from OpenAI's research and development efforts. Previously serving on OpenAI's board of directors, Simo's background in leading development at Facebook suggests a focus on end-users rather than businesses, potentially paving the way for new subscription services and products aimed at a broader audience. OpenAI is also rolling out a new GitHub connector for ChatGPT's deep research agent, allowing users with Plus, Pro, or Team subscriptions to connect their repositories and ask questions about their code.

Recommended read:
References :
  • AI News | VentureBeat: You can now fine-tune your enterprise’s own version of OpenAI’s o4-mini reasoning model with reinforcement learning
  • www.computerworld.com: OpenAI was founded a decade ago with a focus on research, but it has since expanded into products and infrastructure. Now it is looking to again broaden its presence into user-facing apps. The company announced this week that Fidji Simo will join as CEO of applications, a newly-created position. Simo is the current CEO and chair at grocery delivery company Instacart. She will begin her new role at OpenAI later this year, reporting directly to Sam Altman, who will remain overall CEO and oversee research, compute, and applications.
  • the-decoder.com: OpenAI has appointed Fidji Simo as CEO of its new Applications division, reporting directly to OpenAI CEO Sam Altman.
  • www.marktechpost.com: OpenAI Releases Reinforcement Fine-Tuning (RFT) on o4-mini: A Step Forward in Custom Model Optimization
  • the-decoder.com: OpenAI is expanding its fine-tuning program for o4-mini, introducing Reinforcement Fine-Tuning (RFT) for organizations. The method is designed to help tailor models like o4-mini to highly specific tasks with the help of a programmable grading system.
  • Maginative: OpenAI brings reinforcement fine-tuning and GPT-4.1 Nano Fine-Tuning in the API
  • MarkTechPost: OpenAI Releases Reinforcement Fine-Tuning (RFT) on o4-mini: A Step Forward in Custom Model Optimization
  • Techzine Global: OpenAI opens the door to reinforcement fine-tuning for o4-mini
  • THE DECODER: OpenAI is expanding its fine-tuning program for o4-mini, introducing Reinforcement Fine-Tuning (RFT) for organizations. The method is designed to help tailor models like o4-mini to highly specific tasks with the help of a programmable grading system.
  • AI News | VentureBeat: Last night, OpenAI published a blog post on its official website authored by CEO and co-founder Sam Altman announcing a major new hire: Fidji Simo, currently CEO and Chair at grocery delivery company Instacart, will join OpenAI as CEO of Applications, a newly created executive position. Simo will …
  • techxplore.com: OpenAI offers to help countries build AI systems
  • The Register - Software: OpenAI drafts Instacart boss as CEO of Apps to lure in the normies

@the-decoder.com //
OpenAI is expanding its global reach through strategic partnerships with governments and the introduction of advanced model customization tools. The organization has launched the "OpenAI for Countries" program, an initiative designed to collaborate with governments worldwide on building robust AI infrastructure. This program aims to assist nations in setting up data centers and adapting OpenAI's products to meet local language and specific needs. OpenAI envisions this initiative as part of a broader global strategy to foster cooperation and advance AI capabilities on an international scale.

This expansion also includes technological advancements, with OpenAI releasing Reinforcement Fine-Tuning (RFT) for its o4-mini reasoning model. RFT enables enterprises to fine-tune their own versions of the model using reinforcement learning, tailoring it to their unique data and operational requirements. This allows developers to customize the model to better fit their needs using OpenAI’s platform dashboard, tweaking it for internal terminology, goals, processes and more. Once deployed, if an employee or leader at the company wants to use it through a custom internal chatbot orcustom OpenAI GPTto pull up private, proprietary company knowledge, answer specific questions about company products and policies, or generate new communications and collateral in the company’s voice, they can do so more easily with their RFT version of the model.

The "OpenAI for Countries" program is slated to begin with ten international projects, supported by funding from both OpenAI and participating governments. Chris Lehane, OpenAI's vice president of global policy, indicated that the program was inspired by the AI Action Summit in Paris, where several countries expressed interest in establishing their own "Stargate"-style projects. Moreover, the release of RFT on o4-mini signifies a major step forward in custom model optimization, offering developers a powerful new technique for tailoring foundation models to specialized tasks. This allows for fine-grained control over how models improve, by defining custom objectives and reward functions.

Recommended read:
References :
  • the-decoder.com: OpenAI launches a program to partner with governments on global AI infrastructure
  • AI News | VentureBeat: You can now fine-tune your enterprise’s own version of OpenAI’s o4-mini reasoning model with reinforcement learning
  • www.marktechpost.com: OpenAI releases Reinforcement Fine-Tuning (RFT) on o4-mini: A Step Forward in Custom Model Optimization
  • MarkTechPost: OpenAI Releases Reinforcement Fine-Tuning (RFT) on o4-mini: A Step Forward in Custom Model Optimization
  • AI News | VentureBeat: OpenAI names Instacart leader Fidji Simo as new CEO of Applications
  • techxplore.com: OpenAI offers to help countries build AI systems
  • THE DECODER: OpenAI adds new fine-tuning options for o4-mini and GPT-4.1
  • the-decoder.com: OpenAI is expanding its fine-tuning program for o4-mini, introducing Reinforcement Fine-Tuning (RFT) for organizations.
  • Techzine Global: OpenAI opens the door to reinforcement fine-tuning for o4-mini
  • Maginative: OpenAI Brings Reinforcement Fine-Tuning and GPT-4.1 Nano Fine-Tuning in the API

@www.marktechpost.com //
References: the-decoder.com , Ken Yeung , Towards AI ...
Meta is making significant strides in the AI landscape, highlighted by the release of Llama Prompt Ops, a Python package aimed at streamlining prompt adaptation for Llama models. This open-source tool helps developers enhance prompt effectiveness by transforming inputs to better suit Llama-based LLMs, addressing the challenge of inconsistent performance across different AI models. Llama Prompt Ops facilitates smoother cross-model prompt migration and improves performance and reliability, featuring a transformation pipeline for systematic prompt optimization.

Meanwhile, Meta is expanding its AI strategy with the launch of a standalone Meta AI app, powered by Llama 4, to compete with rivals like Microsoft’s Copilot and ChatGPT. This app is designed to function as a general-purpose chatbot and a replacement for the “Meta View” app used with Meta Ray-Ban glasses, integrating a social component with a public feed showcasing user interactions with the AI. Meta also previewed its Llama API, designed to simplify the integration of its Llama models into third-party products, attracting AI developers with an open-weight model that supports modular, specialized applications.

However, Meta's AI advancements are facing legal challenges, as a US judge is questioning the company's claim that training AI on copyrighted books constitutes fair use. The case, focusing on Meta's Llama model, involves training data including works by Sarah Silverman. The judge raised concerns that using copyrighted material to create a product capable of producing an infinite number of competing products could undermine the market for original works, potentially obligating Meta to pay licenses to copyright holders.

Recommended read:
References :
  • the-decoder.com: US judge questions Meta's claim that training AI on copyrighted books is fair use
  • Ken Yeung: IN THIS ISSUE: Meta hosts its first-ever event around its Llama model, launching a standalone app to take on Microsoft’s Copilot and ChatGPT.
  • MarkTechPost: Meta AI has released Llama Prompt Ops, a Python package designed to streamline the process of adapting prompts for Llama models.
  • Towards AI: Meta AI has unveiled Llama 4, the latest iteration of its open large language models, marking a substantial breakthrough with native multimodality at its core.

@docs.llamaindex.ai //
LlamaIndex is advancing agentic systems design by focusing on the optimal blend of autonomy and structure, particularly through its innovative Workflows system. Workflows provide an event-based mechanism for orchestrating agent execution, connecting individual steps implemented as vanilla functions. This approach enables developers to create chains, branches, loops, and collections within their agentic systems, aligning with established design patterns for effective agents. The system, available in both Python and TypeScript frameworks, is fundamentally simple yet powerful, allowing for complex orchestration of agentic tasks.

LlamaIndex Workflows support hybrid systems by allowing decisions about control flow to be made by LLMs, traditional imperative programming, or a combination of both. This flexibility is crucial for building robust and adaptable AI solutions. Furthermore, Workflows not only facilitate the implementation of agents but also enable the use of sub-agents within each step. This hierarchical agent design can be leveraged to decompose complex tasks into smaller, more manageable units, enhancing the overall efficiency and effectiveness of the system.

The introduction of Workflows underscores LlamaIndex's commitment to providing developers with the tools they need to build sophisticated knowledge assistants and agentic applications. By offering a system that balances autonomy with structured execution, LlamaIndex is addressing the need for design principles when building agents. The company draws from its experience with LlamaCloud and its collaboration with enterprise customers to offer a system that integrates agents, sub-agents, and flexible decision-making capabilities.

Recommended read:
References :
  • Blog on LlamaIndex: LlamaIndex Blog post Bending without breaking: optimal design patterns for effective agents
  • docs.llamaindex.ai: Bending without breaking: optimal design patterns for effective agents

@the-decoder.com //
References: composio.dev , THE DECODER ,
OpenAI is actively benchmarking its language models, including o3 and o4-mini, against competitors like Gemini 2.5 Pro, to evaluate their performance in reasoning and tool use efficiency. Benchmarks like the Aider polyglot coding test show that o3 leads in some areas, achieving a new state-of-the-art score of 79.60% compared to Gemini 2.5's 72.90%. However, this performance comes at a higher cost, with o3 being significantly more expensive. O4-mini offers a slightly more balanced price-performance ratio, costing less than o3 while still surpassing Gemini 2.5 on certain tasks. Testing reveals Gemini 2.5 excels in context awareness and iterating on code, making it preferable for real-world use cases, while o4-mini surprisingly excelled in competitive programming.

Open AI have just launched its GPT-Image-1 model for image generation to developers via API. Previously, this model was only accessible through ChatGPT. The versatility of the model means that it can create images across diverse styles, custom guidelines, world knowledge, and accurately render text. The company's blog post said that this unlocks countless practical applications across multiple domains.

Several enterprises and startups are already incorporating the model for creative projects, products, and experiences. Image processing with GPT-Image-1 is billed by tokens. Text input tokens, or the prompt text, will cost $5 per 1 million tokens. Image input tokens will be $10 per million tokens, while image output tokens, or the generated image, will be a whopping $40 per million tokens. Depending on the selected image quality,costs typically range from $0.02 to $0.19 per image.

Recommended read:
References :

Michael Nuñez@AI News | VentureBeat //
Amazon Web Services (AWS) has announced significant advancements in its AI coding and Large Language Model (LLM) infrastructure. A key highlight is the introduction of SWE-PolyBench, a comprehensive multi-language benchmark designed to evaluate the performance of AI coding assistants. This benchmark addresses the limitations of existing evaluation frameworks by assessing AI agents across a diverse range of programming languages like Python, JavaScript, TypeScript, and Java, using real-world scenarios derived from over 2,000 curated coding challenges from GitHub issues. The aim is to provide researchers and developers with a more accurate understanding of how well these tools can navigate complex codebases and solve intricate programming tasks involving multiple files.

The latest Amazon SageMaker Large Model Inference (LMI) container v15, powered by vLLM 0.8.4, further enhances LLM capabilities. This version supports a wider array of open-source models, including Meta’s Llama 4 models and Google’s Gemma 3, providing users with more flexibility in model selection. LMI v15 delivers significant performance improvements through an async mode and support for the vLLM V1 engine, resulting in higher throughput and reduced CPU overhead. This enables seamless deployment and serving of large language models at scale, with expanded API schema support and multimodal capabilities for vision-language models.

AWS is also launching new Amazon EC2 Graviton4-based instances with NVMe SSD storage. These compute optimized (C8gd), general purpose (M8gd), and memory optimized (R8gd) instances offer up to 30% better compute performance and 40% higher performance for I/O intensive database workloads compared to Graviton3-based instances. They also include larger instance sizes with up to 3x more vCPUs, memory, and local storage. These instances are ideal for storage intensive Linux-based workloads including containerized and micro-services-based applications built using Amazon Elastic Kubernetes Service(Amazon EKS),Amazon Elastic Container Service(Amazon ECS),Amazon Elastic Container Registry(Amazon ECR), Kubernetes, and Docker, as well as applications written in popular programming languages such as C/C++, Rust, Go, Java, Python, .NET Core, Node.js, Ruby, and PHP.

Recommended read:
References :
  • venturebeat.com: Amazon’s SWE-PolyBench just exposed the dirty secret about your AI coding assistant
  • www.marktechpost.com: AWS Introduces SWE-PolyBench: A New Open-Source Multilingual Benchmark for Evaluating AI Coding Agents

@www.microsoft.com //
Microsoft Research is delving into the transformative potential of AI as "Tools for Thought," aiming to redefine AI's role in supporting human cognition. At the upcoming CHI 2025 conference, researchers will present four new research papers and co-host a workshop exploring this intersection of AI and human thinking. The research includes a study on how AI is changing the way we think and work along with three prototype systems designed to support different cognitive tasks. The goal is to explore how AI systems can be used as Tools for Thought and reimagine AI’s role in human thinking.

As AI tools become increasingly capable, Microsoft has unveiled new AI agents designed to enhance productivity in various domains. The "Researcher" agent can tackle complex research tasks by analyzing work data, emails, meetings, files, chats, and web information to deliver expertise on demand. Meanwhile, the "Analyst" agent functions as a virtual data scientist, capable of processing raw data from multiple spreadsheets to forecast demand or visualize customer purchasing patterns. The new AI agents unveiled over the past few weeks can help people every day with things like research, cybersecurity and more.

Johnson & Johnson has reportedly found that only a small percentage, between 10% and 15%, of AI use cases deliver the vast majority (80%) of the value. After encouraging employees to experiment with AI and tracking the results of nearly 900 use cases over about three years, the company is now focusing resources on the highest-value projects. These high-value applications include a generative AI copilot for sales representatives and an internal chatbot answering employee questions. Other AI tools being developed include one for drug discovery and another for identifying and mitigating supply chain risks.

Recommended read:
References :

@www.searchenginejournal.com //
Recent advancements are showing language models (LLMs) are expanding past basic writing and are now being used to generate functional code. These models can produce full scripts, browser extensions, and web applications from natural language prompts, opening up opportunities for those without coding skills. Marketers and other professionals can now automate repetitive tasks, build custom tools, and experiment with technical solutions more easily than ever before. This unlocks a new level of efficiency, allowing individuals to create one-off tools for tasks that previously seemed too time-consuming to justify automation.

Advances in AI are also focusing on improving the accuracy of code generated by LLMs. Researchers at MIT have developed a new approach that guides LLMs to generate code that adheres to the rules of the specific programming language. This method allows the LLM to prioritize outputs that are likely to be valid and accurate, improving computational efficiency. This new architecture has enabled smaller LLMs to outperform larger models in generating accurate outputs in fields like molecular biology and robotics. The goal is to allow non-experts to control AI-generated content by ensuring that the outputs are both useful and correct, potentially improving programming assistants, AI-powered data analysis, and scientific discovery tools.

New tools are emerging to aid developers, such as Amazon Q Developer and OpenAI Codex CLI. Amazon Q Developer is an AI-powered coding assistant that integrates into IDEs like Visual Studio Code, providing context-aware code recommendations, snippets, and unit test suggestions. The service uses advanced generative AI to understand the context of a project and offers features like intelligent code generation, integrated testing and debugging, seamless documentation and effective code review and refactoring. Similarly, OpenAI Codex CLI is a terminal-based AI assistant that allows developers to interact with OpenAI models using natural language to read, modify, and run code. These tools aim to boost coding productivity by assisting with tasks like bug fixing, refactoring, and prototyping.

Recommended read:
References :
  • hackernoon.com: Amazon Q Developer: The Future of AI-Enhanced Coding Productivity
  • Search Engine Journal: LLMs That Code: Why Marketers Should Care (Even If You’ve Never Touched An IDE)
  • www.marktechpost.com: Anthropic Releases a Comprehensive Guide to Building Coding Agents with Claude Code

@www.quantamagazine.org //
Recent developments in the field of large language models (LLMs) are focusing on enhancing reasoning capabilities through reinforcement learning. This approach aims to improve model accuracy and problem-solving, particularly in challenging tasks. While some of the latest LLMs, such as GPT-4.5 and Llama 4, were not explicitly trained using reinforcement learning for reasoning, the release of OpenAI's o3 model shows that strategically investing in compute and tailored reinforcement learning methods can yield significant improvements.

Competitors like xAI and Anthropic have also been incorporating more reasoning features into their models, such as the "thinking" or "extended thinking" button in xAI Grok and Anthropic Claude. The somewhat muted response to GPT-4.5 and Llama 4, which lack explicit reasoning training, suggests that simply scaling model size and data may be reaching its limits. The field is now exploring ways to make language models work better, including the use of reinforcement learning.

One of the ways that researchers are making language models work better is to sidestep the requirement for language as an intermediary step. Language isn't always necessary, and that having to turn ideas into language can slow down the thought process. LLMs process information in mathematical spaces, within deep neural networks, however, they must often leave this latent space for the much more constrained one of individual words. Recent papers suggest that deep neural networks can allow language models to continue thinking in mathematical spaces before producing any text.

Recommended read:
References :
  • pub.towardsai.net: The article discusses the application of reinforcement learning to improve the reasoning abilities of LLMs.
  • Sebastian Raschka, PhD: This blog post delves into the current state of reinforcement learning in enhancing LLM reasoning capabilities, highlighting recent advancements and future expectations.
  • Quanta Magazine: This article explores the use of reinforcement learning to make Language Models work better, especially in challenging reasoning tasks.

Chris McKay@Maginative //
OpenAI has unveiled its latest advancements in AI technology with the launch of the GPT-4.1 family of models. This new suite includes GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano, all accessible via API, and represents a significant leap forward in coding capabilities, instruction following, and context processing. Notably, these models feature an expanded context window of up to 1 million tokens, enabling them to handle larger codebases and extensive documents. The GPT-4.1 family aims to cater to a wide range of developer needs by offering different performance and cost profiles, with the goal of creating more advanced and efficient AI applications.

These models demonstrate superior results on various benchmarks compared to their predecessors, GPT-4o and GPT-4o mini. Specifically, GPT-4.1 showcases a substantial improvement on the SWE-bench Verified coding test with a 54.6% increase, and a 38.3% increase on Scale’s MultiChallenge for instruction following. Each model is designed with a specific purpose in mind: GPT-4.1 excels in high-level cognitive tasks like software development and research, GPT-4.1 mini offers a balanced performance with reduced latency and cost, while GPT-4.1 nano provides the quickest and most affordable option for tasks such as classification. All three models have knowledge updated through June 2024.

The introduction of the GPT-4.1 family also brings about changes in OpenAI's existing model offerings. The GPT-4.5 Preview model in the API is set to be deprecated on July 14, 2025, due to GPT-4.1 offering comparable or better utility at a lower cost. In terms of pricing, GPT-4.1 is 26% less expensive than GPT-4o for median queries, along with increased prompt caching discounts. Early testers have already noted positive outcomes, with improvements in code review suggestions and data retrieval from large documents. OpenAI emphasizes that many underlying improvements are being integrated into the current GPT-4o version within ChatGPT.

Recommended read:
References :
  • TestingCatalog: OpenAI debuts GPT-4.1 family offering 1M token context window
  • venturebeat.com: OpenAI slashes prices for GPT-4.1, igniting AI price war among tech giants
  • Interconnects: OpenAI's latest models optimizing on intelligence per dollar.
  • THE DECODER: OpenAI launches GPT-4.1: New model family to improve agents, long contexts and coding
  • Simon Willison's Weblog: OpenAI three new models this morning: GPT-4.1, GPT-4.1 mini and GPT-4.1 nano. These are API-only models right now, not available through the ChatGPT interface (though you can try them out in OpenAI's ).
  • Analytics Vidhya: All About OpenAI’s Latest GPT 4.1 Family
  • pub.towardsai.net: TAI #148: New API Models from OpenAI (4.1) & xAI (grok-3); Exploring Deep Research’s Scaling Laws
  • Towards AI: The GPT-4.1 models, accessible via API, provide a significant advancement in AI capabilities and offer an intriguing alternative for developers looking for high performance at lower cost.
  • Towards AI: TAI #148: New API Models from OpenAI (4.1) & xAI (grok-3); Exploring Deep Research’s Scaling Laws
  • venturebeat.com: OpenAI’s new GPT-4.1 models can process a million tokens and solve coding problems better than ever
  • techstrong.ai: Just days after announcing its plans to retire GPT-4 in ChatGPT, OpenAI on Monday launched a new set of flagship models named GPT-4.1. The release, which The Verge anticipated in an article last week, included the standard version GPT-4.1 model, along with two smaller models — GPT-4.1 mini, and GPT-4.1 nano which OpenAI touts as […]
  • the-decoder.com: OpenAI launches GPT-4.1: New model family to improve agents, long contexts and coding
  • www.tomsguide.com: OpenAI's latest model is here but it isn't GPT-5, it's 4.1, a model all about coding
  • shellypalmer.com: Shelly Palmer discusses the launch of GPT-4.1 and its improved capabilities.
  • felloai.com: OpenAI Quietly Launched GPT‑4.1 – A GPT-4o Successor That’s Crushing Benchmarks
  • thezvi.wordpress.com: The Zvi discusses the mini upgrade from GPT-4.1.
  • bdtechtalks.com: GPT-4.1: OpenAI’s most confusing model
  • Fello AI: OpenAI Quietly Launched GPT‑4.1 – A GPT-4o Successor That’s Crushing Benchmarks
  • www.eweek.com: eWeek reports on the pros and cons of OpenAI's new GPT-4.1 model.
  • Last Week in AI: Last Week in AI discusses the new GPT 4.1 model release by OpenAI
  • Fello AI: OpenAI’s language models have become part of everyday life for millions of people—whether you’re using ChatGPT to get quick answers, brainstorm ideas, or even generate code. With each new version, the models get faster, smarter, and more capable.
  • thezvi.wordpress.com: Yesterday’s news alert, nevertheless: The verdict is in. GPT-4.1-Mini in particular is an excellent practical model, offering strong performance at a good price. The full GPT-4.1 is an upgrade to OpenAI’s more expensive API offerings, it is modestly better but …
  • composio.dev: GPT-4.1 vs. Deepseek v3 vs. Sonnet 3.7 vs. GPT-4.5
  • hackernoon.com: OpenAI announced GPT-4.1, featuring a staggering 1M-token context window and perfect needle-in-a-haystack accuracy.
  • Shelly Palmer: OpenAI has launched GPT-4.1, along with GPT-4.1 Mini and GPT-4.1 Nano. These models are for developers and will not show up in your ChatGPT model picker.
  • eWEEK: OpenAI is releasing new language models, GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano.