DeeperML - News about #benchmarking

@zdnet.com //

Salesforce Tackles AI Reliability, Security for Enterprise Agents

Salesforce is tackling the challenge of "jagged intelligence" in AI, aiming to enhance the reliability and consistency of enterprise AI agents. The company's AI Research division has introduced new benchmarks, models, and guardrails designed to make these agents more intelligent, trusted, and versatile for business applications. This initiative seeks to bridge the gap between an AI system's potential intelligence and its ability to perform consistently in unpredictable real-world enterprise environments. Salesforce is focusing on "Enterprise General Intelligence" (EGI), which prioritizes consistency alongside capability for AI agents in complex business settings.

Salesforce AI Research is addressing AI's inconsistency problem by introducing the SIMPLE dataset, a public benchmark with 225 reasoning questions to measure the "jaggedness" of AI systems. They have also introduced ContextualJudgeBench, which evaluates an agent’s ability to maintain accuracy and faithfulness in context-specific answers, emphasizing factual correctness and the ability to abstain from answering when appropriate, especially in sensitive fields like law, finance, and healthcare. These tools are essential for diagnosing and mitigating the erratic behavior of AI agents across tasks of similar complexity.

A recent Salesforce survey of 2,552 U.S. consumers reveals a growing acceptance of AI agents, with roughly half (53%) wanting AI to simplify complex information. Furthermore, Salesforce is expanding its Trust Layer with new safeguards, including the SFR-Guard model family, to detect prompt injections, toxic outputs, and hallucinations in both open-domain and CRM-specific data. Overall, the survey makes it clear that AI agents are already starting to have a societal impact.

References :

venturebeat.com: Salesforce takes aim at â€˜jagged intelligenceâ€™ in push for more reliable AI
MarkTechPost: Salesforce AI Research Introduces New Benchmarks, Guardrails, and Model Architectures to Advance Trustworthy and Capable AI Agents
Salesforce: Salesforce AI Research Delivers New Benchmarks, Guardrails, and Models to Make Future Agents More Intelligent, Trusted, and Versatile
techstrong.ai: Reports on how surveys see individuals warming up to AI Agents.
www.marktechpost.com: Salesforce AI Research Introduces New Benchmarks, Guardrails, and Model Architectures to Advance Trustworthy and Capable AI Agents
www.salesforce.com: Salesforce AI Research Delivers New Benchmarks, Guardrails, and Models to Make Future Agents More Intelligent, Trusted, and Versatile
techstrong.ai: Salesforce Expands Enterprise General Intelligence Ambitions
Salesforce: Salesforce AI Research Delivers New Benchmarks, Guardrails, and Models to Make Future Agents More Intelligent, Trusted, and Versatile
techstrong.ai: Salesforce today expanded the scope of its artificial intelligence (AI) agents to handle more complex multifaceted tasks as part of an ongoing effort to enable enterprise general intelligence (EGI).

Classification:

HashTags: #TrustworthyAI #AgenticAI #AISecurity
Company: Salesforce
Target: Enterprises
Product: Agentforce
Feature: AI Benchmarks
Type: AI
Severity: Informative

@the-decoder.com //

OpenAI Model Benchmarking, Efficiency, and Image Generation API

OpenAI is actively benchmarking its language models, including o3 and o4-mini, against competitors like Gemini 2.5 Pro, to evaluate their performance in reasoning and tool use efficiency. Benchmarks like the Aider polyglot coding test show that o3 leads in some areas, achieving a new state-of-the-art score of 79.60% compared to Gemini 2.5's 72.90%. However, this performance comes at a higher cost, with o3 being significantly more expensive. O4-mini offers a slightly more balanced price-performance ratio, costing less than o3 while still surpassing Gemini 2.5 on certain tasks. Testing reveals Gemini 2.5 excels in context awareness and iterating on code, making it preferable for real-world use cases, while o4-mini surprisingly excelled in competitive programming.

Open AI have just launched its GPT-Image-1 model for image generation to developers via API. Previously, this model was only accessible through ChatGPT. The versatility of the model means that it can create images across diverse styles, custom guidelines, world knowledge, and accurately render text. The company's blog post said that this unlocks countless practical applications across multiple domains.

Several enterprises and startups are already incorporating the model for creative projects, products, and experiences. Image processing with GPT-Image-1 is billed by tokens. Text input tokens, or the prompt text, will cost $5 per 1 million tokens. Image input tokens will be $10 per million tokens, while image output tokens, or the generated image, will be a whopping $40 per million tokens. Depending on the selected image quality,costs typically range from $0.02 to $0.19 per image.

References :

composio.dev: OpenAI o3 vs. Gemini 2.5 Pro vs. o4-mini
THE DECODER: OpenAI adds ChatGPT image model "GPT-Image-1" to API for developers
AI News | VentureBeat: OpenAI makes ChatGPTâ€™s image generation available as API

Classification:

HashTags: #GenerativeAI #LLMs #OpenAIBenchmarking
Company: OpenAI
Target: Developers
Product: o3
Feature: Benchmarking
Type: AI
Severity: Informative

News from the AI & ML world

DeeperML - #benchmarking

Salesforce Tackles AI Reliability, Security for Enterprise Agents

Classification:

OpenAI Model Benchmarking, Efficiency, and Image Generation API

Classification:

Benchmarks

Blogs

Research Tools