OpenAI Gives Agents Voice via New Audio Models

Chris McKay@Maginative //

OpenAI Gives Agents Voice via New Audio Models

OpenAI has recently unveiled new audio models based on GPT-4o, significantly enhancing its text-to-speech and speech-to-text capabilities. These new tools are intended to give AI agents a voice, enabling a range of applications, with demonstrations including the ability for an AI to read emails in character. The announcement includes the introduction of new transcription models, specifically gpt-4o-transcribe and gpt-4o-mini-transcribe, which are designed to outperform the existing Whisper model.

The text-to-speech and speech-to-text tools are based on GPT-4o. While these models show promise, some experts have noted potential vulnerabilities. Like other large language model (LLM)-driven multi-modal models, they appear susceptible to prompt-injection-adjacent issues, stemming from the mixing of instructions and data within the same token stream. OpenAI hinted it may take a similar path with video.

Original img attribution: https://www.maginative.com/content/images/size/w1200/2025/03/GPT-4o-transcribe.jpg

ImgSrc: www.maginative.

References :

AI News | VentureBeat: OpenAIâ€™s new voice AI model gpt-4o-transcribe lets you add speech to your existing text apps in seconds
Analytics Vidhya: OpenAIâ€™s Audio Models: How to Access, Features, Applications, and More
Maginative: OpenAI Unveils New Audio Models to Make AI Agents Sound More Human Than Ever
bsky.app: I published some notes on OpenAI's new text-to-speech and speech-to-text models.
Samrat Man Singh: OpenAI announced some new audio models yesterday, including new transcription models( gpt-4o-transcribe and gpt-4o-mini-transcribe ).
www.techrepublic.com: The text-to-speech and speech-to-text tools are all based on GPT-4o. OpenAI hinted it may take a similar path with video.
MarkTechPost: Reports on OpenAI introducing advanced audio models.
Simon Willison's Weblog: OpenAI announced today, for both text-to-speech and speech-to-text. They're very promising new models, but they appear to suffer from the ever-present risk of accidental (or malicious) instruction following.
THE DECODER: OpenAI has released a new generation of audio models that let developers customize how their AI assistants speak.
venturebeat.com: DeepSeek-V3 now runs at 20 tokens per second on Mac Studio, and thatâ€™s a nightmare for OpenAI
Last Week in AI: #204 - OpenAI Audio, Rubin GPUs, MCP, Zochi

Classification:

HashTags: #OpenAI #GPT4o #AudioAI
Company: OpenAI
Target: Users
Product: GPT-4o
Feature: Audio Models
Type: AI
Severity: Informative

News from the AI & ML world

DeeperML

OpenAI Gives Agents Voice via New Audio Models

Classification: