@www.marktechpost.com
//
OpenAI has introduced HealthBench, a new open-source benchmark designed to evaluate AI performance in realistic healthcare scenarios. Developed in collaboration with over 262 physicians, HealthBench uses 5,000 multi-turn conversations and over 48,000 rubric criteria to grade AI models across seven medical domains and 49 languages. The benchmark assesses AI responses based on communication quality, instruction following, accuracy, contextual understanding, and completeness, providing a comprehensive evaluation of AI capabilities in healthcare. OpenAI’s latest models, including o3 and GPT-4.1, have shown impressive results on this benchmark.
The most provocative finding from the HealthBench evaluation is that the newest AI models are performing at or beyond the level of human experts in crafting responses to medical queries. Earlier tests from September 2024 showed that doctors could improve AI outputs by editing them, scoring higher than doctors working without AI. However, with the latest April 2025 models, like o3 and GPT-4.1, physicians using these AI responses as a base, on average, did not further improve them. This suggests that for the specific task of generating HealthBench responses, the newest AI matches or exceeds the capabilities of human experts, even with a strong AI starting point. In related news, FaceAge, a face-reading AI tool developed by researchers at Mass General Brigham, demonstrates promising abilities in predicting cancer outcomes. By analyzing facial photographs, FaceAge estimates a person's biological age and can predict cancer survival with an impressive 81% accuracy rate. This outperforms clinicians in predicting short-term life expectancy, especially for patients receiving palliative radiotherapy. FaceAge identifies subtle facial features associated with aging and provides a quantifiable measure of biological aging that correlates with survival outcomes and health risks, offering doctors more objective and precise survival estimates. References :
Classification:
|
BenchmarksBlogsResearch Tools |