@zdnet.com
//
Salesforce is tackling the challenge of "jagged intelligence" in AI, aiming to enhance the reliability and consistency of enterprise AI agents. The company's AI Research division has introduced new benchmarks, models, and guardrails designed to make these agents more intelligent, trusted, and versatile for business applications. This initiative seeks to bridge the gap between an AI system's potential intelligence and its ability to perform consistently in unpredictable real-world enterprise environments. Salesforce is focusing on "Enterprise General Intelligence" (EGI), which prioritizes consistency alongside capability for AI agents in complex business settings.
Salesforce AI Research is addressing AI's inconsistency problem by introducing the SIMPLE dataset, a public benchmark with 225 reasoning questions to measure the "jaggedness" of AI systems. They have also introduced ContextualJudgeBench, which evaluates an agent’s ability to maintain accuracy and faithfulness in context-specific answers, emphasizing factual correctness and the ability to abstain from answering when appropriate, especially in sensitive fields like law, finance, and healthcare. These tools are essential for diagnosing and mitigating the erratic behavior of AI agents across tasks of similar complexity. A recent Salesforce survey of 2,552 U.S. consumers reveals a growing acceptance of AI agents, with roughly half (53%) wanting AI to simplify complex information. Furthermore, Salesforce is expanding its Trust Layer with new safeguards, including the SFR-Guard model family, to detect prompt injections, toxic outputs, and hallucinations in both open-domain and CRM-specific data. Overall, the survey makes it clear that AI agents are already starting to have a societal impact. References :
Classification:
@the-decoder.com
//
OpenAI is actively benchmarking its language models, including o3 and o4-mini, against competitors like Gemini 2.5 Pro, to evaluate their performance in reasoning and tool use efficiency. Benchmarks like the Aider polyglot coding test show that o3 leads in some areas, achieving a new state-of-the-art score of 79.60% compared to Gemini 2.5's 72.90%. However, this performance comes at a higher cost, with o3 being significantly more expensive. O4-mini offers a slightly more balanced price-performance ratio, costing less than o3 while still surpassing Gemini 2.5 on certain tasks. Testing reveals Gemini 2.5 excels in context awareness and iterating on code, making it preferable for real-world use cases, while o4-mini surprisingly excelled in competitive programming.
Open AI have just launched its GPT-Image-1 model for image generation to developers via API. Previously, this model was only accessible through ChatGPT. The versatility of the model means that it can create images across diverse styles, custom guidelines, world knowledge, and accurately render text. The company's blog post said that this unlocks countless practical applications across multiple domains. Several enterprises and startups are already incorporating the model for creative projects, products, and experiences. Image processing with GPT-Image-1 is billed by tokens. Text input tokens, or the prompt text, will cost $5 per 1 million tokens. Image input tokens will be $10 per million tokens, while image output tokens, or the generated image, will be a whopping $40 per million tokens. Depending on the selected image quality,costs typically range from $0.02 to $0.19 per image. References :
Classification:
|
BenchmarksBlogsResearch Tools |