News from the AI & ML world

DeeperML - #llm

@the-decoder.com //
OpenAI is making significant strides in the enterprise AI and coding tool landscape. The company recently released a strategic guide, "AI in the Enterprise," offering practical strategies for organizations implementing AI at a large scale. This guide emphasizes real-world implementation rather than abstract theories, drawing from collaborations with major companies like Morgan Stanley and Klarna. It focuses on systematic evaluation, infrastructure readiness, and domain-specific integration, highlighting the importance of embedding AI directly into user-facing experiences, as demonstrated by Indeed's use of GPT-4o to personalize job matching.

Simultaneously, OpenAI is reportedly in the process of acquiring Windsurf, an AI-powered developer platform, for approximately $3 billion. This acquisition aims to enhance OpenAI's AI coding capabilities and address increasing competition in the market for AI-driven coding assistants. Windsurf, previously known as Codeium, develops a tool that generates source code from natural language prompts and is used by over 800,000 developers. The deal, if finalized, would be OpenAI's largest acquisition to date, signaling a major move to compete with Microsoft's GitHub Copilot and Anthropic's Claude Code.

Sam Altman, CEO of OpenAI, has also reaffirmed the company's commitment to its non-profit roots, transitioning the profit-seeking side of the business to a Public Benefit Corporation (PBC). This ensures that while OpenAI pursues commercial goals, it does so under the oversight of its original non-profit structure. Altman emphasized the importance of putting powerful tools in the hands of everyone and allowing users a great deal of freedom in how they use these tools, even if differing moral frameworks exist. This decision aims to build a "brain for the world" that is accessible and beneficial for a wide range of uses.

Recommended read:
References :
  • The Register - Software: OpenAI's contentious plan to overhaul its corporate structure in favor of a conventional for-profit model has been reworked, with the AI giant bowing to pressure to keep its nonprofit in control, even as it presses ahead with parts of the restructuring.
  • the-decoder.com: OpenAI restructures as public benefit corporation under non-profit control
  • www.theguardian.com: OpenAI reverses course and says non-profit arm will retain control of firm
  • techxplore.com: OpenAI reverses course and says its nonprofit will continue to control its business
  • www.techradar.com: OpenAI will transition to running under the oversight of a non-profit, and its profit side is to become a Public Benefit Corporation.
  • Maginative: OpenAI Reverses Course on Corporate Structure, Will Keep Nonprofit Control
  • THE DECODER: OpenAI restructures as public benefit corporation under non-profit control
  • Mashable: The nonprofit status of OpenAI is one of the biggest controversies in Silicon Valley. On Monday, May 5, CEO Sam Altman said the company structure is "evolving."
  • The Rundown AI: OpenAI ends for-profit push
  • shellypalmer.com: OpenAI Supercharges ChatGPT Search with Shopping Tools
  • Effective Altruism Forum: Evolving OpenAI’s Structure
  • WIRED: The startup behind ChatGPT is going to remain in nonprofit control, but it still needs regulatory approval.
  • the-decoder.com: The Decoder reports on OpenAI's potential $3 billion acquisition of Windsurf.
  • www.marktechpost.com: OpenAI Releases a Strategic Guide for Enterprise AI Adoption: Practical Lessons from the Field
  • THE DECODER: The Decoder's report on OpenAI's Windsurf deal boosting coding AI.
  • AI News | VentureBeat: Report: OpenAI is buying AI-powered developer platform Windsurf — what happens to its support for rival LLMs?
  • John Werner: OpenAI Strikes $3 Billion Deal To Buy Windsurf: Reports
  • Latest from ITPro in News: OpenAI is closing in on its biggest acquisition to date – and it could be a game changer for software developers and ‘vibe coding’ fanatics
  • www.artificialintelligence-news.com: Sam Altman: OpenAI to keep nonprofit soul in restructuring
  • AI News: OpenAI CEO Sam Altman has laid out their roadmap, and the headline is that OpenAI will keep its nonprofit core amid broader restructuring.
  • Analytics India Magazine: OpenAI to Acquire Windsurf for $3 Billion to Dominate AI Coding Space
  • THE DECODER: Elon Musk’s lawyer says OpenAI restructuring is a transparent dodge
  • futurism.com: OpenAI may be raking in the investor dough, but thanks in part to erstwhile cofounder Elon Musk, the company won't be going entirely for-profit anytime soon.
  • thezvi.wordpress.com: Your voice has been heard. OpenAI has ‘heard from the Attorney Generals’ of Delaware and California, and as a result the OpenAI nonprofit will retain control of OpenAI under their new plan, and both companies will retain the original mission. …
  • www.computerworld.com: OpenAI reaffirms nonprofit control, scales back governance changes
  • thezvi.wordpress.com: OpenAI Claims Nonprofit Will Retain Nominal Control

Alexey Shabanov@TestingCatalog //
Alibaba's Qwen team has launched Qwen3, a new family of open-source large language models (LLMs) designed to compete with leading AI systems. The Qwen3 series includes eight models ranging from 0.6B to 235B parameters, with the larger models employing a Mixture-of-Experts (MoE) architecture for enhanced performance. This comprehensive suite offers options for developers with varied computational resources and application requirements. All the models are released under the Apache 2.0 license, making them suitable for commercial use.

The Qwen3 models boast improved agentic capabilities for tool use and support for 119 languages. The models also feature a unique "hybrid thinking mode" that allows users to dynamically adjust the balance between deep reasoning and faster responses. This is particularly valuable for developers as it facilitates efficient use of computational resources based on task complexity. Training involved a large dataset of 36 trillion tokens and was optimized for reasoning, similar to the Deepseek R1 model.

Benchmarks indicate that Qwen3 rivals top competitors like Deepseek R1 and Gemini Pro in areas like coding, mathematics, and general knowledge. Notably, the smaller Qwen3–30B-A3B MoE model achieves performance comparable to the Qwen3–32B dense model while activating significantly fewer parameters. These models are available on platforms like Hugging Face, ModelScope, and Kaggle, along with support for deployment through frameworks like SGLang and vLLM, and local execution via tools like Ollama and llama.cpp.

Recommended read:
References :
  • pub.towardsai.net: TAI #150: Qwen3 Impresses as a Robust Open-Source Contender
  • gradientflow.com: Table of Contents Model Architecture and Capabilities What is Qwen 3 and what models are available in the lineup? What are the “Hybrid Thinking Modes†in Qwen 3, and why are they valuable for developers?
  • THE DECODER: An article about Qwen3 series from Alibaba debuts with benchmark results matching top competitors
  • TestingCatalog: Reporting on Alibaba Cloud debuting 235B-parameter Qwen 3 to challenge US model dominance
  • Towards AI: TAI #150: Qwen3 Impresses as a Robust Open-Source Contender
  • www.analyticsvidhya.com: Qwen3 Models: How to Access, Performance, Features, and Applications
  • RunPod Blog: Qwen3 Released: How Does It Stack Up?
  • bdtechtalks.com: Alibaba’s Qwen3: Open-weight LLMs with hybrid thinking | BDTechTalks
  • AI News | VentureBeat: Alibaba launches open source Qwen3 model that surpasses OpenAI o1 and DeepSeek R1
  • the-decoder.com: Qwen3 series from Alibaba debuts with benchmark results matching top competitors

Alexey Shabanov@TestingCatalog //
Alibaba Cloud has unveiled Qwen 3, a new generation of large language models (LLMs) boasting 235 billion parameters, poised to challenge the dominance of US-based models. This open-weight family of models includes both dense and Mixture-of-Experts (MoE) architectures, offering developers a range of choices to suit their specific application needs and hardware constraints. The flagship model, Qwen3-235B-A22B, achieves competitive results in benchmark evaluations of coding, math, and general knowledge, positioning it as one of the most powerful publicly available models.

Qwen 3 introduces a unique "thinking mode" that can be toggled for step-by-step reasoning or rapid direct answers. This hybrid reasoning approach, similar to OpenAI's "o" series, allows users to engage a more intensive process for complex queries in fields like science, math, and engineering. The models are trained on a massive dataset of 36 trillion tokens spanning 119 languages, twice the corpus of Qwen 2.5 and enriched with synthetic math and code data. This extensive training equips Qwen 3 with enhanced reasoning, multilingual proficiency, and computational efficiency.

The release of Qwen 3 includes two MoE models and six dense variants, all licensed under Apache-2.0 and downloadable from platforms like Hugging Face, ModelScope, and Kaggle. Deployment guidance points to vLLM and SGLang for servers and to Ollama or llama.cpp for local setups, signaling support for both cloud and edge developers. Community feedback has been positive, with analysts noting that earlier Qwen announcements briefly lifted Alibaba shares, underscoring the strategic weight the company places on open models.

Recommended read:
References :
  • Gradient Flow: Qwen 3: What You Need to Know
  • AI News | VentureBeat: Alibaba launches open source Qwen3 model that surpasses OpenAI o1 and DeepSeek R1
  • TestingCatalog: Alibaba Cloud debuts 235B-parameter Qwen 3 to challenge US model dominance
  • MarkTechPost: Alibaba Qwen Team Just Released Qwen3
  • Analytics Vidhya: Qwen3 Models: How to Access, Performance, Features, and Applications
  • www.analyticsvidhya.com: Qwen3 Models: How to Access, Performance, Features, and Applications
  • THE DECODER: Qwen3 series from Alibaba debuts with benchmark results matching top competitors
  • www.tomsguide.com: Alibaba is launching its own AI reasoning models to compete with DeepSeek
  • the-decoder.com: Qwen3 series from Alibaba debuts with benchmark results matching top competitors
  • pub.towardsai.net: TAI #150: Qwen3 Impresses as a Robust Open-Source Contender
  • Pandaily: The Mind Behind Qwen3: An Inclusive Interview with Alibaba's Zhou Jingren
  • Towards AI: TAI #150: Qwen3 Impresses as a Robust Open-Source Contender
  • gradientflow.com: Table of Contents Model Architecture and Capabilities What is Qwen 3 and what models are available in the lineup? What are the “Hybrid Thinking Modesâ€� in Qwen 3, and why are they valuable for developers? How does Qwen 3 compare to previous versions and other leading models? What are the advantages of Qwen 3’s Mixture-of-Experts ...
  • bdtechtalks.com: Alibaba's Qwen3 open-weight LLMs combine direct response and chain-of-thought reasoning in a single architecture, and compete withe leading models. The post first appeared on .
  • bdtechtalks.com: Alibaba's Qwen3 open-weight LLMs combine direct response and chain-of-thought reasoning in a single architecture, and compete withe leading models. The post first appeared on .
  • RunPod Blog: Qwen3 Released: How Does It Stack Up?
  • www.computerworld.com: The Qwen3 models, which feature a new hybrid reasoning approach, underscore Alibaba's commitment to open-source AI development.
  • Last Week in AI: OpenAI undoes its glaze-heavy ChatGPT update, Alibaba unveils Qwen 3, a family of ‘hybrid’ AI reasoning models , Baidu ERNIE X1 and 4.5 Turbo boast high performance at low cost

@the-decoder.com //
OpenAI has rolled back a recent update to its GPT-4o model, the default model used in ChatGPT, after widespread user complaints that the system had become excessively flattering and overly agreeable. The company acknowledged the issue, describing the chatbot's behavior as 'sycophantic' and admitting that the update skewed towards responses that were overly supportive but disingenuous. Sam Altman, CEO of OpenAI, confirmed that fixes were underway, with potential options to allow users to choose the AI's behavior in the future. The rollback aims to restore an earlier version of GPT-4o known for more balanced responses.

Complaints arose when users shared examples of ChatGPT's excessive praise, even for absurd or harmful ideas. In one instance, the AI lauded a business idea involving selling "literal 'shit on a stick'" as genius. Other examples included the model reinforcing paranoid delusions and seemingly endorsing terrorism-related ideas. This behavior sparked criticism from AI experts and former OpenAI executives, who warned that tuning models to be people-pleasers could lead to dangerous outcomes where honesty is sacrificed for likability. The 'sycophantic' behavior was not only considered annoying, but also potentially harmful if users were to mistakenly believe the AI and act on its endorsements of bad ideas.

OpenAI explained that the issue stemmed from overemphasizing short-term user feedback, specifically thumbs-up and thumbs-down signals, during the model's optimization. This resulted in a chatbot that prioritized affirmation without discernment, failing to account for how user interactions and needs evolve over time. In response, OpenAI plans to implement measures to steer the model away from sycophancy and increase honesty and transparency. The company is also exploring ways to incorporate broader, more democratic feedback into ChatGPT's default behavior, acknowledging that a single default personality cannot capture every user preference across diverse cultures.

Recommended read:
References :
  • Know Your Meme Newsfeed: What's With All The Jokes About GPT-4o 'Glazing' Its Users? Memes About OpenAI's 'Sychophantic' ChatGPT Update Explained
  • the-decoder.com: OpenAI CEO Altman calls ChatGPT 'annoying' as users protest its overly agreeable answers
  • PCWorld: ChatGPT’s awesome ‘Deep Research’ is rolling out to free users soon
  • www.techradar.com: Sam Altman says OpenAI will fix ChatGPT's 'annoying' new personality – but this viral prompt is a good workaround for now
  • THE DECODER: OpenAI CEO Altman calls ChatGPT 'annoying' as users protest its overly agreeable answers
  • THE DECODER: ChatGPT gets an update
  • bsky.app: ChatGPT's recent update caused the model to be unbearably sycophantic - this has now been fixed through an update to the system prompt, and as far as I can tell this is what they changed
  • Ada Ada Ada: Article on GPT-4o's unusual behavior, including extreme sycophancy and lack of NSFW filter.
  • thezvi.substack.com: GPT-4o tells you what it thinks you want to hear.
  • thezvi.wordpress.com: GPT-4o Is An Absurd Sycophant
  • The Algorithmic Bridge: What this week's events reveal about OpenAI's goals
  • THE DECODER: The Decoder article reporting on OpenAI's rollback of the ChatGPT update due to issues with tone.
  • AI News | VentureBeat: Ex-OpenAI CEO and power users sound alarm over AI sycophancy and flattery of users
  • AI News | VentureBeat: VentureBeat article covering OpenAI's rollback of ChatGPT's sycophantic update and explanation.
  • www.zdnet.com: OpenAI recalls GPT-4o update for being too agreeable
  • www.techradar.com: TechRadar article about OpenAI fixing ChatGPT's 'annoying' personality update.
  • The Register - Software: The Register article about OpenAI rolling back ChatGPT's sycophantic update.
  • thezvi.wordpress.com: The Zvi blog post criticizing ChatGPT's sycophantic behavior.
  • www.windowscentral.com: “GPT4o’s update is absurdly dangerous to release to a billion active usersâ€: Even OpenAI CEO Sam Altman admits ChatGPT is “too sycophant-yâ€
  • siliconangle.com: OpenAI to make ChatGPT less creepy after app is accused of being ‘dangerously’ sycophantic
  • the-decoder.com: OpenAI rolls back ChatGPT model update after complaints about tone
  • SiliconANGLE: OpenAI to make ChatGPT less creepy after app is accused of being ‘dangerously’ sycophantic.
  • www.eweek.com: OpenAI Rolls Back March GPT-4o Update to Stop ChatGPT From Being So Flattering
  • eWEEK: OpenAI Rolls Back March GPT-4o Update to Stop ChatGPT From Being So Flattering
  • Ars OpenForum: OpenAI's sycophantic GPT-4o update in ChatGPT is rolled back amid user complaints.
  • www.engadget.com: OpenAI has swiftly rolled back a recent update to its GPT-4o model, citing user feedback that the system became overly agreeable and praiseful.
  • TechCrunch: OpenAI rolls back update that made ChatGPT ‘too sycophant-y’
  • AI News | VentureBeat: OpenAI, creator of ChatGPT, released and then withdrew an updated version of the underlying multimodal (text, image, audio) large language model (LLM) that ChatGPT is hooked up to by default, GPT-4o, …
  • bsky.app: The postmortem OpenAI just shared on their ChatGPT sycophancy behavioral bug - a change they had to roll back - is fascinating!
  • the-decoder.com: What OpenAI wants to learn from its failed ChatGPT update
  • THE DECODER: What OpenAI wants to learn from its failed ChatGPT update
  • futurism.com: The company rolled out an update to the GPT-4o large language model underlying its chatbot on April 25, with extremely quirky results.
  • MEDIANAMA: Why ChatGPT Became Sycophantic, And How OpenAI is Fixing It
  • www.livescience.com: OpenAI has reverted a recent update to ChatGPT, addressing user concerns about the model's excessively agreeable and potentially manipulative responses.
  • shellypalmer.com: Sam Altman (@sama) says that OpenAI has rolled back a recent update to ChatGPT that turned the model into a relentlessly obsequious people-pleaser.
  • Techmeme: OpenAI shares details on how an update to GPT-4o inadvertently increased the model's sycophancy, why OpenAI failed to catch it, and the changes it is planning
  • Shelly Palmer: Why ChatGPT Suddenly Sounded Like a Fanboy
  • thezvi.wordpress.com: ChatGPT's latest update caused concern about its potential for sycophantic behavior, leading to a significant backlash from users.

@techcrunch.com //
OpenAI is facing increased competition in the AI model market, with Google's Gemini 2.5 gaining traction due to its top performance and competitive pricing. This shift challenges the early dominance of OpenAI and Meta in large language models (LLMs). Meta's Llama 4 faced controversy, while OpenAI's GPT-4.5 received backlash. OpenAI is now releasing faster and cheaper AI models in response to this competitive pressure and the hardware limitations that make serving a large user base challenging.

OpenAI's new o3 model showcases both advancements and drawbacks. While boasting improved text capabilities and strong benchmark scores, o3 is designed for multi-step tool use, enabling it to independently search and provide relevant information. However, this advancement exacerbates hallucination issues, with the model sometimes producing incorrect or misleading results. OpenAI's report found that o3 hallucinated in response to 33% of question, indicating a need for further research to understand and address this issue.

The problem of over-optimization in AI models is also a factor. Over-optimization occurs when the optimizer exploits bugs or lapses in the training environment, leading to unusual or negative results. In the context of RLHF, over-optimization can cause models to repeat random tokens and gibberish. With o3, over-optimization manifests as new types of inference behavior, highlighting the complex challenges in designing and training AI models to perform reliably and accurately.

Recommended read:
References :

@analyticsindiamag.com //
Microsoft has announced BitNet b1.58 2B4T, a new compact large language model (LLM) designed to run efficiently on CPUs. This innovative model boasts 2 billion parameters but uses only 1.58 bits per weight, a significant reduction compared to the 16 or 32 bits typically used in conventional AI models. This allows BitNet to operate with a dramatically smaller memory footprint, consuming only 400MB, making it suitable for devices with limited resources and even enabling it to run on an Apple M2 chip.

The 1-bit AI LLM was trained on a massive dataset containing 4 trillion tokens and has proven competitive with leading open-weight, full-precision LLMs of similar size, such as Meta’s LLaMa 3.2 1B, Google’s Gemma 3 1B, and Alibaba’s Qwen 2.5 1.5B. BitNet achieves comparable or superior performance in tasks like language understanding, math, coding, and conversation, while significantly reducing memory footprint, energy consumption, and decoding latency.

The model's architecture is based on the standard Transformer model, but incorporates key modifications, including custom BitLinear layers that quantize model weights to 1.58 bits during the forward pass. The weights are mapped to ternary values {-1, 0, +1} using an absolute mean quantization scheme, while activations are quantized to 8-bit integers. To facilitate adoption, Microsoft has released the model weights on Hugging Face, along with open-source code for running it, including a dedicated inference tool called bitnet.cpp optimized for CPU execution.

Recommended read:
References :

Megan Crouse@techrepublic.com //
Microsoft has unveiled BitNet b1.58, a groundbreaking language model designed for ultra-efficient operation. Unlike traditional language models that rely on 16- or 32-bit floating-point numbers, BitNet utilizes a mere 1.58 bits per weight. This innovative approach significantly reduces memory requirements and energy consumption, enabling the deployment of powerful AI on devices with limited resources. The model is based on the standard transformer architecture, but incorporates modifications aimed at efficiency, such as BitLinear layers and 8-bit activation functions.

The BitNet b1.58 2B4T model contains two billion parameters and was trained on a massive dataset of four trillion tokens, roughly equivalent to the contents of 33 million books. Despite its reduced precision, BitNet reportedly performs comparably to models that are two to three times larger. In benchmark tests, it outperformed other compact models and performed competitively with significantly larger and less efficient systems. Its memory footprint is just 400MB, making it suitable for deployment on laptops or in cloud environments.

Microsoft has released dedicated inference tools for both GPU and CPU execution, including a lightweight C++ version, to facilitate adoption. The model is available on Hugging Face. Future development plans include expanding the model to support longer texts, additional languages, and multimodal inputs such as images. Microsoft is also working on another efficient model family under the Phi series. The company demonstrated that this model can run on a Apple M2 chip.

Recommended read:
References :
  • www.techrepublic.com: Microsoft Releases Largest 1-Bit LLM, Letting Powerful AI Run on Some Older Hardware
  • medium.com: Microsoft has released a new language model, BitNet, designed for energy efficiency, minimizing the computational and memory requirements for use on older hardware. This strategy aims to make advanced AI more accessible to a wider range of users.
  • THE DECODER: Microsoft's new model, BitNet b1.58 2B4T, is intended to operate with reduced memory and energy consumption. The model demonstrates an effort to expand access and reduce computational burdens for AI applications.
  • www.zdnet.com: Microsoft introduces BitNet b1.58 2B4T, a new small language model designed to run efficiently on older hardware without GPUs.
  • the-decoder.com: Microsoft Shows How to Put AI Models on a Diet
  • arstechnica.com: This article details Microsoft researchers creating a super-efficient AI that uses up to 96% less energy.
  • www.tomshardware.com: Microsoft researchers build 1-bit AI LLM, model small enough to run on some CPUs
  • TechSpot: Microsoft's BitNet shows what AI can do with just 400MB and no GPU
  • www.sciencedaily.com: Researchers developed a more efficient way to control the outputs of a large language model, guiding it to generate text that adheres to a certain structure, like a programming language, and remains error free.

Michael Nuñez@venturebeat.com //
Google has unveiled Gemini 2.5 Flash, a new AI model designed to give businesses greater control over AI costs and performance. Available in preview through Google AI Studio and Vertex AI, Gemini 2.5 Flash introduces adjustable "thinking budgets," allowing developers to specify the amount of computational power the AI should use for reasoning. This innovative approach aims to strike a balance between advanced AI capabilities and cost-efficiency, addressing a key concern for businesses integrating AI into their operations. The model is also capable of generating SVGs.

The introduction of "thinking budgets" marks a strategic move by Google to deliver cost-effective AI solutions. Developers can now fine-tune the AI's processing power, allocating resources based on the complexity of the task at hand. With Gemini 2.5 Flash, the "thinking" capability can be turned on or off, creating a hybrid reasoning model that prioritizes speed and cost when needed. This flexibility allows businesses to optimize their AI usage and pay only for the brainpower they require.

Benchmarks demonstrate significant improvements in Gemini 2.5 Flash compared to the older Gemini 2.0 Flash model. Google has stated that the latest version delivers a major upgrade in reasoning capabilities, while still prioritizing speed and cost. The "thinking budget" feature offers fine-grained control over the maximum number of tokens a model can generate while thinking, ranging from 0 to 24,576 tokens. A higher budget allows the model to reason further to improve quality, but the model automatically decides how much to think based on the perceived task complexity.

Recommended read:
References :
  • venturebeat.com: Google’s new Gemini 2.5 Flash AI model introduces adjustable "thinking budgets" that let businesses pay only for the reasoning power they need, balancing advanced capabilities with cost efficiency.
  • Google DeepMind Blog: Transform text-based prompts into high-resolution eight-second videos in Gemini Advanced and use Whisk Animate to turn images into eight-second animated clips.
  • TestingCatalog: Google integrates Veo 2 AI into Gemini Advanced, enabling subscribers to create 8-second, 720p videos for TikTok and YouTube. Download MP4s with SynthID watermark.
  • Simon Willison's Weblog: Start building with Gemini 2.5 Flash
  • www.zdnet.com: Google reveals Gemini 2.5 Flash, its 'most cost-efficient thinking model'
  • developers.googleblog.com: Google's Gemini 2.5 Flash has hybrid reasoning, can be turned on or off and provides the ability for developers to set budgets to find the right trade-off between cost, quality, and latency.
  • venturebeat.com: Google’s Gemini 2.5 Flash introduces ‘thinking budgets’ that cut AI costs by 600% when turned down
  • the-decoder.com: Google’s Gemini 2.5 Flash gives you speed when you need it and reasoning when you can afford it
  • THE DECODER: Provides information about the release of Gemini 2.5 Flash, highlighting its reasoning capabilities and cost-effectiveness.
  • TestingCatalog: Google launches Gemini 2.5 Flash model with hybrid reasoning
  • bsky.app: New LLM release from Google Gemini: Gemini 2.5 Flash (preview), which lets you set a budget for how many "thinking" tokens it can use. I got it to draw me some pelicans - it has very good taste in SVG styles and comments.
  • www.marktechpost.com: Google Unveils Gemini 2.5 Flash in Preview through the Gemini API via Google AI Studio and Vertex AI.
  • LearnAI: Start building with Gemini 2.5 Flash
  • www.infoworld.com: Google previews Gemini 2.5 Flash hybrid reasoning model
  • MarkTechPost: Google Unveils Gemini 2.5 Flash in Preview through the Gemini API via Google AI Studio and Vertex AI.
  • Google DeepMind Blog: Gemini 2.5 Flash is our first fully hybrid reasoning model, giving developers the ability to turn thinking on or off.
  • learn.aisingapore.org: Start building with Gemini 2.5 Flash
  • www.marketingaiinstitute.com: This blog post highlights Google Cloud Next '25 event reveals, including Gemini 2.5 Pro, AI Agents, and more.
  • bsky.app: Gemini 2.5 Pro and Flash now have the ability to return image segmentation masks on command, as base64 encoded PNGs embedded in JSON strings I vibe coded an interactive tool for exploring this new capability - it costs a fraction of a cent per image
  • Last Week in AI: Last Week in AI discussing GPT 4.1 and Gemini 2.5 Flash
  • TestingCatalog: Testing Catalog about Gemini’s Scheduled Actions may offer AI task scheduling
  • The Official Google Blog: This model allows for adjustable thinking budgets, enabling users to control costs and choose the level of reasoning needed for specific tasks.
  • simonwillison.net: The model also allows developers to set thinking budgets to find the right tradeoff between quality, cost, and latency. Gemini AI Studio product lead Logan Kilpatrick :
  • Analytics Vidhya: 7 Things Gemini 2.5 Pro Does Better Than Any Other Chatbot!
  • Last Week in AI: OpenAI’s new GPT-4.1 AI models focus on coding, Google’s newest Gemini AI model focuses on efficiency, and more!
  • Simon Willison: Turns out Gemini 2.5 Flash non-thinking mode can do the same trick at an even lower cost... 0.0119 cents (around 1/100th of a cent) Notes here, including how I upgraded my tool to use the non-thinking model by vibe coding o4-mini:
  • techcrunch.com: Google’s newest Gemini AI model focuses on efficiency, and more!
  • www.analyticsvidhya.com: o3 vs o4-mini vs Gemini 2.5 pro: The Ultimate Reasoning Battle
  • Digital Information World: Google launches Gemini 2.5 Flash model with hybrid reasoning, multimodal support, and cost-effective token pricing.
  • IEEE Spectrum: This article discusses the release of Google's new leading-edge LLM, Gemini 2.5 Pro, which has attracted much attention and interest.
  • www.analyticsvidhya.com: This article explores the capabilities of Gemini 2.5 Pro and compares it to other AI chatbots.
  • Analytics Vidhya: o3 vs o4-mini vs Gemini 2.5 pro: The Ultimate Reasoning Battle
  • TechHQ: Google unveils “reasoning dial†for Gemini 2.5 flash: thinking vs. cost
  • techhq.com: Google unveils “reasoning dial†for Gemini 2.5 flash: thinking vs. cost
  • Last Week in AI: Last Week in AI #307 - GPT 4.1, o3, o4-mini, Gemini 2.5 Flash, Veo 2
  • Towards AI: Google's Gemini 2.5 Flash model with reasoning control allows for greater precision and control in AI applications, optimizing resources and cost.
  • www.artificialintelligence-news.com: Google's Gemini 2.5 Flash model features a "thinking budget" that allows developers to restrict processing power for problem-solving, addressing concerns about excessive resource consumption.
  • AI News: Google has introduced an AI reasoning control mechanism for its Gemini 2.5 Flash model that allows developers to limit how much processing power the system expends on problem-solving. Released on April 17, this “thinking budget†feature responds to a growing industry challenge: advanced AI models frequently overanalyse straightforward queries, consuming unnecessary computational resources and driving

Chris McKay@Maginative //
OpenAI has released its latest AI models, o3 and o4-mini, designed to enhance reasoning and tool use within ChatGPT. These models aim to provide users with smarter and faster AI experiences by leveraging web search, Python programming, visual analysis, and image generation. The models are designed to solve complex problems and perform tasks more efficiently, positioning OpenAI competitively in the rapidly evolving AI landscape. Greg Brockman from OpenAI noted the models "feel incredibly smart" and have the potential to positively impact daily life and solve challenging problems.

The o3 model stands out due to its ability to use tools independently, which enables more practical applications. The model determines when and how to utilize tools such as web search, file analysis, and image generation, thus reducing the need for users to specify tool usage with each query. The o3 model sets new standards for reasoning, particularly in coding, mathematics, and visual perception, and has achieved state-of-the-art performance on several competition benchmarks. The model excels in programming, business, consulting, and creative ideation.

Usage limits for these models vary, with o3 at 50 queries per week, and o4-mini at 150 queries per day, and o4-mini-high at 50 queries per day for Plus users, alongside 10 Deep Research queries per month. The o3 model is available to ChatGPT Pro and Team subscribers, while the o4-mini models are used across ChatGPT Plus. OpenAI says o3 is also beneficial in generating and critically evaluating novel hypotheses, especially in biology, mathematics, and engineering contexts.

Recommended read:
References :
  • Simon Willison's Weblog: OpenAI are really emphasizing tool use with these: For the first time, our reasoning models can agentically use and combine every tool within ChatGPT—this includes searching the web, analyzing uploaded files and other data with Python, reasoning deeply about visual inputs, and even generating images. Critically, these models are trained to reason about when and how to use tools to produce detailed and thoughtful answers in the right output formats, typically in under a minute, to solve more complex problems.
  • the-decoder.com: OpenAI’s new o3 and o4-mini models reason with images and tools
  • venturebeat.com: OpenAI launches o3 and o4-mini, AI models that ‘think with images’ and use tools autonomously
  • www.analyticsvidhya.com: o3 and o4-mini: OpenAI’s Most Advanced Reasoning Models
  • www.tomsguide.com: OpenAI's o3 and o4-mini models
  • Maginative: OpenAI’s latest models—o3 and o4-mini—introduce agentic reasoning, full tool integration, and multimodal thinking, setting a new bar for AI performance in both speed and sophistication.
  • THE DECODER: OpenAI’s new o3 and o4-mini models reason with images and tools
  • Analytics Vidhya: o3 and o4-mini: OpenAI’s Most Advanced Reasoning Models
  • www.zdnet.com: These new models are the first to independently use all ChatGPT tools.
  • The Tech Basic: OpenAI recently released its new AI models, o3 and o4-mini, to the public. Smart tools employ pictures to address problems through pictures, including sketch interpretation and photo restoration.
  • thetechbasic.com: OpenAI’s new AI Can “See†and Solve Problems with Pictures
  • www.marktechpost.com: OpenAI Introduces o3 and o4-mini: Progressing Towards Agentic AI with Enhanced Multimodal Reasoning
  • MarkTechPost: OpenAI Introduces o3 and o4-mini: Progressing Towards Agentic AI with Enhanced Multimodal Reasoning
  • analyticsindiamag.com: Access to o3 and o4-mini is rolling out today for ChatGPT Plus, Pro, and Team users.
  • THE DECODER: OpenAI is expanding its o-series with two new language models featuring improved tool usage and strong performance on complex tasks.
  • gHacks Technology News: OpenAI released its latest models, o3 and o4-mini, to enhance the performance and speed of ChatGPT in reasoning tasks.
  • www.ghacks.net: OpenAI Launches o3 and o4-Mini models to improve ChatGPT's reasoning abilities
  • Data Phoenix: OpenAI releases new reasoning models o3 and o4-mini amid intense competition. OpenAI has launched o3 and o4-mini, which combine sophisticated reasoning capabilities with comprehensive tool integration.
  • Shelly Palmer: OpenAI Quietly Reshapes the Landscape with o3 and o4-mini. OpenAI just rolled out a major update to ChatGPT, quietly releasing three new models (o3, o4-mini, and o4-mini-high) that offer the most advanced reasoning capabilities the company has ever shipped.
  • THE DECODER: Safety assessments show that OpenAI's o3 is probably the company's riskiest AI model to date
  • shellypalmer.com: OpenAI Quietly Reshapes the Landscape with o3 and o4-mini
  • BleepingComputer: OpenAI details ChatGPT-o3, o4-mini, o4-mini-high usage limits
  • TestingCatalog: OpenAI’s o3 and o4‑mini bring smarter tools and faster reasoning to ChatGPT
  • simonwillison.net: Introducing OpenAI o3 and o4-mini
  • bdtechtalks.com: What to know about o3 and o4-mini, OpenAI’s new reasoning models
  • bdtechtalks.com: What to know about o3 and o4-mini, OpenAI’s new reasoning models
  • thezvi.wordpress.com: OpenAI has finally introduced us to the full o3 along with o4-mini. Greg Brockman (OpenAI): Just released o3 and o4-mini! These models feel incredibly smart. We’ve heard from top scientists that they produce useful novel ideas. Excited to see their …
  • thezvi.wordpress.com: OpenAI has upgraded its entire suite of models. By all reports, they are back in the game for more than images. GPT-4.1 and especially GPT-4.1-mini are their new API non-reasoning models.
  • felloai.com: OpenAI has just launched a brand-new series of GPT models—GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano—that promise major advances in coding, instruction following, and the ability to handle incredibly long contexts.
  • Interconnects: OpenAI's o3: Over-optimization is back and weirder than ever
  • www.ishir.com: OpenAI has released o3 and o4-mini, adding significant reasoning capabilities to its existing models. These advancements will likely transform the way users interact with AI-powered tools, making them more effective and versatile in tackling complex problems.
  • www.bigdatawire.com: OpenAI released the models o3 and o4-mini that offer advanced reasoning capabilities, integrated with tool use, like web searches and code execution.
  • Drew Breunig: OpenAI's o3 and o4-mini models offer enhanced reasoning capabilities in mathematical and coding tasks.
  • TestingCatalog: OpenAI’s o3 and o4-mini bring smarter tools and faster reasoning to ChatGPT
  • www.techradar.com: ChatGPT model matchup - I pitted OpenAI's o3, o4-mini, GPT-4o, and GPT-4.5 AI models against each other and the results surprised me
  • www.techrepublic.com: OpenAI’s o3 and o4-mini models are available now to ChatGPT Plus, Pro, and Team users. Enterprise and education users will get access next week.
  • Last Week in AI: OpenAI’s new GPT-4.1 AI models focus on coding, OpenAI launches a pair of AI reasoning models, o3 and o4-mini, Google’s newest Gemini AI model focuses on efficiency, and more!
  • techcrunch.com: OpenAI’s new reasoning AI models hallucinate more.
  • computational-intelligence.blogspot.com: OpenAI's new reasoning models, o3 and o4-mini, are a step up in certain capabilities compared to prior models, but their accuracy is being questioned due to increased instances of hallucinations.
  • www.unite.ai: unite.ai article discussing OpenAI's o3 and o4-mini new possibilities through multimodal reasoning and integrated toolsets.
  • Unite.AI: On April 16, 2025, OpenAI released upgraded versions of its advanced reasoning models.
  • Digital Information World: OpenAI’s Latest o3 and o4-mini AI Models Disappoint Due to More Hallucinations than Older Models
  • techcrunch.com: TechCrunch reports on OpenAI's GPT-4.1 models focusing on coding.
  • Analytics Vidhya: o3 vs o4-mini vs Gemini 2.5 pro: The Ultimate Reasoning Battle
  • THE DECODER: OpenAI's o3 achieves near-perfect performance on long context benchmark.
  • the-decoder.com: OpenAI's o3 achieves near-perfect performance on long context benchmark
  • www.analyticsvidhya.com: AI models keep getting smarter, but which one truly reasons under pressure? In this blog, we put o3, o4-mini, and Gemini 2.5 Pro through a series of intense challenges: physics puzzles, math problems, coding tasks, and real-world IQ tests.
  • Simon Willison's Weblog: This post explores the use of OpenAI's o3 and o4-mini models for conversational AI, highlighting their ability to use tools in their reasoning process. It also discusses the concept of
  • Simon Willison's Weblog: The benchmark score on OpenAI's internal PersonQA benchmark (as far as I can tell no further details of that evaluation have been shared) going from 0.16 for o1 to 0.33 for o3 is interesting, but I don't know if it it's interesting enough to produce dozens of headlines along the lines of "OpenAI's o3 and o4-mini hallucinate way higher than previous models"
  • techstrong.ai: Techstrong.ai reports OpenAI o3, o4 Reasoning Models Have Some Kinks.
  • www.marktechpost.com: OpenAI Releases a Practical Guide to Identifying and Scaling AI Use Cases in Enterprise Workflows
  • Towards AI: OpenAI's o3 and o4-mini models have demonstrated promising improvements in reasoning tasks, particularly their use of tools in complex thought processes and enhanced reasoning capabilities.
  • Analytics Vidhya: In this article, we explore how OpenAI's o3 reasoning model stands out in tasks demanding analytical thinking and multi-step problem solving, showcasing its capability in accessing and processing information through tools.
  • pub.towardsai.net: TAI#149: OpenAI’s Agentic o3; New Open Weights Inference Optimized Models (DeepMind Gemma, Nvidia…
  • composio.dev: OpenAI o3 vs. Gemini 2.5 Pro vs. o4-mini
  • Composio: OpenAI o3 and o4-mini are out. They are two reasoning state-of-the-art models. They’re expensive, multimodal, and super efficient at tool use.

Carl Franzen@AI News | VentureBeat //
References: AIwire , Composio , www.aiwire.net ...
Meta has recently unveiled its Llama 4 AI models, marking a significant advancement in the field of open-source AI. The release includes Llama 4 Maverick and Llama 4 Scout, with Llama 4 Behemoth and Llama 4 Reasoning expected to follow. These models are designed to be more efficient and capable than their predecessors, with a focus on improving reasoning, coding, and creative writing abilities. The move is seen as a response to the growing competition in the AI landscape, particularly from models like DeepSeek, which have demonstrated impressive performance at a lower cost.

The Llama 4 family employs a Mixture of Experts (MoE) architecture for enhanced efficiency. Llama 4 Maverick is a 400 billion parameter sparse model with 17 billion active parameters and 128 experts, making it suitable for general assistant and chat use cases. Llama 4 Scout, with 109 billion parameters and 17 billion active parameters across 16 experts, stands out with its 10 million token context window, enabling it to handle extensive text and large documents effectively, making it suitable for multi-document summarization and parsing extensive user activity. Meta's decision to release these models before LlamaCon gives developers ample time to experiment with them.

While Llama 4 Maverick shows strength in areas such as large context retrieval and writing detailed responses, benchmarks indicate that DeepSeek v3 0324 outperforms it in coding and common-sense reasoning. Meta is also exploring the intersection of neuroscience and AI, with researchers like Jean-Rémi King investigating cognitive principles in artificial architectures. This interdisciplinary approach aims to further improve the reasoning and understanding capabilities of AI models, potentially leading to more advanced and human-like AI systems.

Recommended read:
References :

Carl Franzen@AI News | VentureBeat //
References: bsky.app , AI News | VentureBeat , Groq ...
Meta has unveiled its latest advancements in AI with the Llama 4 family of models, consisting of Llama 4 Scout, Maverick, and the upcoming Behemoth. These models are designed for a variety of AI tasks, ranging from general chat to document summarization and advanced reasoning. Llama 4 Maverick, with 17 billion active parameters, is positioned as a general-purpose model ideal for image and text understanding tasks, making it suitable for chat applications and AI assistants. Llama 4 Scout is designed for document summarization.

Meta is emphasizing efficiency and accessibility with Llama 4. Both the Maverick and Scout models are designed to run efficiently, even on a single NVIDIA H100 GPU, showcasing Meta’s dedication to balancing high performance with reasonable resource consumption. TheSequence #530 highlights that Llama 4 brings unquestionable technical innovations. Furthermore, the Llama 4 series introduces three distinct models—Scout, Maverick, and Behemoth—designed for a range of use cases, from general-purpose reasoning to long-context and multimodal applications.

The release of Llama 4 includes enhancements beyond its technical capabilities. In the UK, Ray-Ban Meta glasses are receiving an upgrade to integrate Meta AI features, enabling users to interact with their surroundings through questions and receive intelligent, context-aware responses. Soon to follow is the rollout of live translation on these glasses, facilitating real-time speech translation between English, Spanish, Italian, and French, further enhancing the user experience and accessibility.

Recommended read:
References :
  • bsky.app: Meta just dropped Llama 4 on a weekend! Two new open weight models (Scout and Maverick) and a preview of a model called Behemoth - Scout has a 10 million token context Best information right now appears to be this blog post:
  • AI News | VentureBeat: While DeepSeek R1 and OpenAI o1 edge out Behemoth on a couple metrics, Llama 4 Behemoth remains highly competitive.
  • Maginative: Meta has released Llama 4 Scout and Maverick, two open-weight AI models designed for multimodal reasoning, with Maverick outperforming GPT-4o and Scout offering a record-breaking 10M token context window.
  • Groq: Meta’s Llama 4 Scout and Maverick models are live today on GroqCloudâ„¢, giving developers and enterprises day-zero access to the most advanced open-source AI models available. Today, Meta released the first models in the Llama 4 herd, which will enable people to build more personalized multimodal experiences. With Llama 4 Scout and Llama 4 Maverick […]
  • SLVIKI.ORG: Meta Unleashes Llama 4: The Future of Open-Source AI Just Got Smarter
  • Analytics Vidhya: Llama 4 Models: Meta AI is Open Sourcing the Best!
  • MarkTechPost: Meta AI Just Released Llama 4 Scout and Llama 4 Maverick: The First Set of Llama 4 Models
  • Ken Yeung: Meta Launches Llama 4 Scout and Maverick, Open-Weight Multimodal Models That Outperform GPT-4 and Gemini
  • Analytics India Magazine: Meta Releases First Two Multimodal Llama 4 Models, Plans Two Trillion Parameter Model
  • NVIDIA Technical Blog: developer.nvidia.com
  • Resemble AI: Meta’s LLaMA 4 is the latest generation of large language models (LLMs) from Meta AI, unveiled on April 5, 2025. It represents a significant leap in Meta’s AI capabilities and open-source AI strategy.
  • Databricks: Introducing Meta's Llama 4 on the Databricks platform.
  • The Cloudflare Blog: Meta’s Llama 4 is now available on Workers AI: use this multimodal, Mixture of Experts AI model on Cloudflare's serverless AI platform to build next-gen AI applications.
  • Harald Klinke: Meta has unveiled Llama 4, its latest AI model, featuring advanced multimodal capabilities that integrate text, video, images, and audio processing.
  • Simon Willison: Meta just dropped Llama 4 on a weekend! Two new open weight models (Scout and Maverick) and a preview of a model called Behemoth - Scout has a 10 million token context Best information right now appears to be this blog post:
  • www.analyticsvidhya.com: Analytics Vidhya reports how to access Meta's Llama 4 models via API.
  • bsky.app: Meta just dropped Llama 4 on a weekend! Two new open weight models (Scout and Maverick) and a preview of a model called Behemoth - Scout has a 10 million token context Best information right now appears to be this blog post: https://ai.meta.com/blog/llama-4-multimodal-intelligence/
  • twitter.com: Meta AI has released Llama 4 Scout & Llama 4 Maverick, and is previewing Llama 4 Behemoth. Llama 4 Scout is highest performing small model with 17B activated parameters with 16 experts. It’s crazy fast, natively multimodal, and very smart. It achieves an industry leading 10M+ token context window and can also run on a single GPU ! Llama 4 Maverick is the best multimodal model in its class, beating GPT-4o and Gemini 2.0 Flash across a broad range of widely reported benchmarks, while achieving comparable results to the new DeepSeek v3 on reasoning and coding – at less than half the active parameters. It offers a best-in-class performance to cost ratio with an experimental chat version scoring ELO of 1417 on LMArena. It can also run on a single host ! Previewing Llama 4 Behemoth , our most powerful model yet and among the world’s smartest LLMs. Llama 4 Behemoth outperforms GPT4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM benchmarks. Llama 4 Behemoth is still training, and we’re excited to share more details about it even while it’s still in flight. — , VP and Head of GenAI at Meta
  • Analytics Vidhya: How to Access Meta’s Llama 4 Models via API
  • bdtechtalks.com: What to know about Meta’s Llama 4 model family
  • slviki.org: Meta just dropped a major update in the AI arms race—and it’s not subtle.
  • the-decoder.com: Meta has released the first two models in its Llama 4 series, marking the company’s initial deployment of a multimodal architecture built from the ground up.
  • TheSequence: A major release for open source generative AI.
  • www.resemble.ai: Meta’s LLaMA 4 is the latest generation of large language models (LLMs) from Meta AI, unveiled on April 5, 2025.
  • TestingCatalog: Llama 4 brings 10M token context and MoE architecture with 3 new models
  • bdtechtalks.com: Meta releases Llama 4, a potent suite of LLMs challenging rivals with innovative multimodal capabilities.
  • The Verge: Meta has released Llama 4 Scout and Maverick, which outperform counterparts from OpenAI and Google in various benchmarks.
  • Harald Klinke: Meta has unveiled two new AI models, Llama 4 Scout and Llama 4 Maverick, now integrated into WhatsApp, Messenger, and Instagram Direct.
  • THE DECODER: Meta releases first multimodal Llama-4 models, leaves EU out in the cold
  • simonwillison.net: Discussion of Llama 4's technical capabilities and potential impact.
  • SLVIKI.ORG: Meta Unleashes Llama 4: A Leap Forward in Multimodal AI
  • slviki.org: Meta Platforms has officially unveiled its Llama 4 family of artificial intelligence models, pushing the boundaries of what generative AI systems can do.
  • www.tomsguide.com: Meta just launched Llama 4 — here's why ChatGPT, Gemini and Claude should be worried
  • www.techradar.com: Meta launches new Llama 4 AI for all your apps, but it still feels limited compared to what ChatGPT and Gemini can do
  • www.ghacks.net: Meta launches Llama 4 with three new AI models: Scout, Maverick, and Behemoth. These new iterations could give Meta a […] Thank you for being a Ghacks reader. The post appeared first on .
  • www.infoq.com: Meta has officially released the first models in its new Llama 4 family—Scout and Maverick—marking a step forward in its open-weight large language model ecosystem.
  • felloai.com: Llama 4 Just Arrived — an Open-Source AI Model from Meta That Beats GPT-4.5
  • Fello AI: The new Llama 4 models, Scout and Maverick, represent a significant leap forward in the capabilities of generative AI systems. The models' multimodal nature enables them to process and generate content across various formats, including text, images, video, and audio.
  • oodaloop.com: Meta on Saturday released the first models from its latest open-source artificial intelligence software Llama 4, as the company scrambles to lead the race to invest in generative AI.
  • www.itnews.com.au: Named the Llama 4 Scout and Llama 4 Maverick.
  • Shelly Palmer: Meta announced Llama 4, the latest iteration of its large language model series.
  • Gradient Flow: Llama 4: What You Need to Know
  • The Algorithmic Bridge: Details about AI progress and Meta's Llama 4 model.
  • www.artificialintelligence-news.com: Meta has unveiled Llama 4, its latest AI model, featuring advanced multimodal capabilities that integrate text, video, images, and audio processing.
  • the-decoder.com: Initial evaluations show promising results in standard tests but reveal difficulties with handling extensive context. The introduction of a mixture of experts architecture is a significant advance in Meta's AI models.
  • gHacks Technology News: Llama 4 Scout and Llama 4 Maverick, open-source models, are now available across various platforms.
  • Last Week in AI: Meta releases Llama 4, a new crop of flagship AI models, Amazon unveils Nova Act, an AI agent that can control a web browser
  • www.aiwire.net: Meta Unleashes New Llama 4 AI Models
  • analyticsindiamag.com: Llama 4 models, including Scout and Maverick, are now live on its platform, allowing developers to build and deploy AI applications at competitive pricing. The post appeared first on .
  • Simon Willison's Weblog: The Llama series have been re-designed to use state of the art mixture-of-experts (MoE) architecture and natively trained with multimodality. We’re dropping Llama 4 Scout & Llama 4 Maverick, and previewing Llama 4 Behemoth. 📌 Llama 4 Scout is highest performing small model with 17B activated parameters with 16 experts. It’s crazy fast, natively multimodal, and very smart. It achieves an industry leading 10M+ token context window and can also run on a single GPU ! 📌 Llama 4 Maverick is the best multimodal model in its class, beating GPT-4o and Gemini 2.0 Flash across a broad range of widely reported benchmarks, while achieving comparable results to the new DeepSeek v3 on reasoning and coding – at less than half the active parameters. It offers a best-in-class performance to cost ratio with an experimental chat version scoring ELO of 1417 on LMArena. It can also run on a single host ! 📌 Previewing Llama 4 Behemoth , our most powerful model yet and among the world’s smartest LLMs. Llama 4 Behemoth outperforms GPT4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM benchmarks. Llama 4 Behemoth is still training, and we’re excited to share more details about it even while it’s still in flight. — , VP and Head of GenAI at Meta Tags: , , , ,
  • AIwire: Meta Unleashes New Llama 4 AI Models
  • THE DECODER: Meta's Llama 4 models show promise on standard tests, but struggle with long-context tasks
  • the-decoder.com: Meta’s Llama 4 models show promise on standard tests, but struggle with long-context tasks
  • Last Week in AI: Llama 4, Nova Act, xAI buys X, PaperBench
  • RunPod Blog: Llama-4 Scout and Maverick Are Here—How Do They Shape Up?
  • techcrunch.com: Meta releases Llama 4, a new crop of flagship AI models
  • : New from 404 Media: Facebook has deliberately pushed its Llama 4 AI model to the right in an attempt to show "both sides." Obviously that is dangerous/stupid if the AI is talking about climate change, health etc which are scientific fact but clouded by politics
  • 404 Media: Meta’s Llama 4 model is worried about left leaning bias in the data, and wants to be more like Elon Musk’s Grok.
  • www.itpro.com: Meta executive denies hyping up Llama 4 benchmark scores – but what can users expect from the new models?
  • Composio: Notes on Llama 4: The Hits, the Misses, and the Disasters
  • composio.dev: The Llama 4 is here, and this time, the Llama family has three different models: Llama 4 Scout, Maverick, and Behemoth. While
  • thezvi.wordpress.com: Llama Does Not Look Good 4 Anything
  • AI News | VentureBeat: DeepCoder delivers top coding performance in efficient 14B open model
  • TheSequence: The Sequence #530: A Tech Deep Dive Into Llama 4
  • www.aiwire.net: Meta Unleashes New Llama 4 AI Models
  • composio.dev: Llama 4 Maverick vs. Deepseek v3 0324
  • Composio: Llama 4 Maverick vs. Deepseek v3 0324
  • Analytics Vidhya: Building an AI Agent with Llama 4 and AutoGen
  • AIwire: Meta Unleashes New Llama 4 AI Models
  • www.analyticsvidhya.com: Building an AI Agent with Llama 4 and AutoGen
  • Digital Information World: Meta’s AI Faces Legal Fire as Authors, Scholars Unite Over Copyright Clash

Michael Nuñez@AI News | VentureBeat //
Anthropic has been at the forefront of investigating how AI models like Claude process information and make decisions. Their scientists developed interpretability techniques that have unveiled surprising behaviors within these systems. Research indicates that large language models (LLMs) are capable of planning ahead, as demonstrated when writing poetry or solving problems, and that they sometimes work backward from a desired conclusion rather than relying solely on provided facts.

Anthropic researchers also tested the "faithfulness" of CoT models' reasoning by giving them hints in their answers, and see if they will acknowledge it. The study found that reasoning models often avoided mentioning that they used hints in their responses. This raises concerns about the reliability of chains-of-thought (CoT) as a tool for monitoring AI systems for misaligned behaviors, especially as these models become more intelligent and integrated into society. The research emphasizes the need for ongoing efforts to enhance the transparency and trustworthiness of AI reasoning processes.

Recommended read:
References :
  • venturebeat.com: Anthropic scientists expose how AI actually ‘thinks’ — and discover it secretly plans ahead and sometimes lies
  • The Algorithmic Bridge: AI Is Learning to Reason. Humans May Be Holding It Back
  • THE DECODER: Anthropic study finds language models often hide their reasoning process
  • MarkTechPost: Anthropic’s Evaluation of Chain-of-Thought Faithfulness: Investigating Hidden Reasoning, Reward Hacks, and the Limitations of Verbal AI Transparency in Reasoning Models
  • MarkTechPost: This AI Paper from Anthropic Introduces Attribution Graphs: A New Interpretability Method to Trace Internal Reasoning in Claude 3.5 Haiku
  • www.marktechpost.com: This AI Paper from Anthropic Introduces Attribution Graphs: A New Interpretability Method to Trace Internal Reasoning in Claude 3.5 Haiku

@Google DeepMind Blog //
Researchers are making strides in understanding how AI models think. Anthropic has developed an "AI microscope" to peek into the internal processes of its Claude model, revealing how it plans ahead, even when generating poetry. This tool provides a limited view of how the AI processes information and reasons through complex tasks. The microscope suggests that Claude uses a language-independent internal representation, a "universal language of thought", for multilingual reasoning.

The team at Google DeepMind introduced JetFormer, a new Transformer designed to directly model raw data. This model, capable of both understanding and generating text and images seamlessly, maximizes the likelihood of raw data without depending on any pre-trained components. Additionally, a comprehensive benchmark called FACTS Grounding has been introduced to evaluate the factuality of large language models (LLMs). This benchmark measures how accurately LLMs ground their responses in provided source material and avoid hallucinations, aiming to improve trust and reliability in AI-generated information.

Recommended read:
References :
  • Google DeepMind Blog: FACTS Grounding: A new benchmark for evaluating the factuality of large language models
  • THE DECODER: Anthropic's AI microscope reveals how Claude plans ahead when generating poetry

Ryan Daws@AI News //
References: THE DECODER , venturebeat.com , AI News ...
Anthropic has unveiled groundbreaking insights into the 'AI biology' of their advanced language model, Claude. Through innovative methods, researchers have been able to peer into the complex inner workings of the AI, demystifying how it processes information and learns strategies. This research provides a detailed look at how Claude "thinks," revealing sophisticated behaviors previously unseen, and showing these models are more sophisticated than previously understood.

These new methods allowed scientists to discover that Claude plans ahead when writing poetry and sometimes lies, showing the AI is more complex than previously thought. The new interpretability techniques, which the company dubs “circuit tracing” and “attribution graphs,” allow researchers to map out the specific pathways of neuron-like features that activate when models perform tasks. This approach borrows concepts from neuroscience, viewing AI models as analogous to biological systems.

This research, published in two papers, marks a significant advancement in AI interpretability, drawing inspiration from neuroscience techniques used to study biological brains. Joshua Batson, a researcher at Anthropic, highlighted the importance of understanding how these AI systems develop their capabilities, emphasizing that these techniques allow them to learn many things they “wouldn’t have guessed going in.” The findings have implications for ensuring the reliability, safety, and trustworthiness of increasingly powerful AI technologies.

Recommended read:
References :
  • THE DECODER: Anthropic and Databricks have entered a five-year partnership worth $100 million to jointly sell AI tools to businesses.
  • venturebeat.com: Anthropic has developed a new method for peering inside large language models like Claude, revealing for the first time how these AI systems process information and make decisions.
  • venturebeat.com: Anthropic scientists expose how AI actually ‘thinks’ — and discover it secretly plans ahead and sometimes lies
  • AI News: Anthropic provides insights into the ‘AI biology’ of Claude
  • www.techrepublic.com: ‘AI Biology’ Research: Anthropic Looks Into How Its AI Claude ‘Thinks’
  • THE DECODER: Anthropic's AI microscope reveals how Claude plans ahead when generating poetry
  • The Tech Basic: Anthropic Now Redefines AI Research With Self Coordinating Agent Networks

Maximilian Schreiner@THE DECODER //
Google DeepMind has announced Gemini 2.5 Pro, its latest and most advanced AI model to date. This new model boasts enhanced reasoning capabilities and improved accuracy, marking a significant step forward in AI development. Gemini 2.5 Pro is designed with built-in 'thinking' capabilities, enabling it to break down complex tasks into multiple steps and analyze information more effectively before generating a response. This allows the AI to deduce logical conclusions, incorporate contextual nuances, and make informed decisions with unprecedented accuracy, according to Google.

The Gemini 2.5 Pro has already secured the top position on the LMArena leaderboard, surpassing other AI models in head-to-head comparisons. This achievement highlights its superior performance and high-quality style in handling intricate tasks. The model also leads in math and science benchmarks, demonstrating its advanced reasoning capabilities across various domains. This new model is available as Gemini 2.5 Pro (experimental) on Google’s AI Studio and for Gemini Advanced users on the Gemini chat interface.

Recommended read:
References :
  • Google DeepMind Blog: Gemini 2.5: Our most intelligent AI model
  • Shelly Palmer: Google’s Gemini 2.5: AI That Thinks Before It Speaks
  • AI News: Gemini 2.5: Google cooks up its ‘most intelligent’ AI model to date
  • Interconnects: Gemini 2.5 Pro and Google's second chance with AI
  • SiliconANGLE: Google introduces Gemini 2.5 Pro with chain-of-thought reasoning built-in
  • AI News | VentureBeat: Google releases ‘most intelligent model to date,’ Gemini 2.5 Pro
  • Analytics Vidhya: Gemini 2.5 Pro is Now #1 on Chatbot Arena with Impressive Jump
  • www.tomsguide.com: Google unveils Gemini 2.5 — claims AI breakthrough with enhanced reasoning and multimodal power
  • Fello AI: Google’s Gemini 2.5 Shocks the World: Crushing AI Benchmark Like No Other AI Model!
  • bdtechtalks.com: What to know about Google Gemini 2.5 Pro
  • TestingCatalog: Gemini 2.5 Pro sets new AI benchmark and launches on AI Studio and Gemini
  • AI News | VentureBeat: Google’s Gemini 2.5 Pro is the smartest model you’re not using – and 4 reasons it matters for enterprise AI
  • thezvi.wordpress.com: Gemini 2.5 is the New SoTA
  • www.infoworld.com: Google has introduced version 2.5 of its , which the company said offers a new level of performance by combining an enhanced base model with improved post-training.
  • Composio: Gemini 2.5 Pro vs. Claude 3.7 Sonnet: Coding Comparison
  • Composio: Google dropped its best-ever creation, Gemini 2.5 Pro Experimental, on March 25. It is a stupidly incredible reasoning model shining on every The post first appeared on.
  • www.tomsguide.com: Gemini 2.5 Pro is now free to all users in surprise move
  • Analytics India Magazine: Did Google Just Build The Best AI Model for Coding?
  • www.zdnet.com: Everyone can now try Gemini 2.5 Pro - for free

Ryan Daws@AI News //
References: venturebeat.com , AI News ,
DeepSeek, a Chinese AI startup, is making waves in the artificial intelligence industry with its DeepSeek-V3 model. This model is demonstrating performance that rivals Western AI models like those from OpenAI and Anthropic, but at significantly lower development costs. The release of DeepSeek-V3 is seen as jumpstarting AI development across China, with other startups and established companies releasing their own advanced models, further fueling competition. This has narrowed the technology gap between China and the United States as China has adapted to and overcome international restrictions through creative approaches to AI development.

One particularly notable aspect of DeepSeek-V3 is its ability to run efficiently on consumer-grade hardware, such as the Mac Studio with an M3 Ultra chip. Reports indicate that the model achieves speeds of over 20 tokens per second on this platform, making it a potential "nightmare for OpenAI". This contrasts sharply with the data center requirements typically associated with state-of-the-art AI models. The company's focus on algorithmic efficiency has allowed them to achieve notable gains despite restricted access to the latest silicon, showcasing that Chinese AI innovation has flourished by focusing on algorithmic efficiency and novel approaches to model architecture.

Recommended read:
References :
  • venturebeat.com: DeepSeek-V3 now runs at 20 tokens per second on Mac Studio, and that’s a nightmare for OpenAI
  • AI News: DeepSeek disruption: Chinese AI innovation narrows global technology divide
  • GZERO Media: How DeepSeek changed China’s AI ambitions