News from the AI & ML world

DeeperML

Chris McKay@Maginative //
OpenAI has recently unveiled new audio models based on GPT-4o, significantly enhancing its text-to-speech and speech-to-text capabilities. These new tools are intended to give AI agents a voice, enabling a range of applications, with demonstrations including the ability for an AI to read emails in character. The announcement includes the introduction of new transcription models, specifically gpt-4o-transcribe and gpt-4o-mini-transcribe, which are designed to outperform the existing Whisper model.

The text-to-speech and speech-to-text tools are based on GPT-4o. While these models show promise, some experts have noted potential vulnerabilities. Like other large language model (LLM)-driven multi-modal models, they appear susceptible to prompt-injection-adjacent issues, stemming from the mixing of instructions and data within the same token stream. OpenAI hinted it may take a similar path with video.
Original img attribution: https://www.maginative.com/content/images/size/w1200/2025/03/GPT-4o-transcribe.jpg
ImgSrc: www.maginative.

Share: bluesky twitterx--v2 facebook--v1 threads


References :
  • AI News | VentureBeat: OpenAI’s new voice AI model gpt-4o-transcribe lets you add speech to your existing text apps in seconds
  • Analytics Vidhya: OpenAI’s Audio Models: How to Access, Features, Applications, and More
  • Maginative: OpenAI Unveils New Audio Models to Make AI Agents Sound More Human Than Ever
  • bsky.app: I published some notes on OpenAI's new text-to-speech and speech-to-text models.
  • Samrat Man Singh: OpenAI announced some new audio models yesterday, including new transcription models( gpt-4o-transcribe and gpt-4o-mini-transcribe ).
  • www.techrepublic.com: The text-to-speech and speech-to-text tools are all based on GPT-4o. OpenAI hinted it may take a similar path with video.
  • MarkTechPost: Reports on OpenAI introducing advanced audio models.
  • Simon Willison's Weblog: OpenAI announced today, for both text-to-speech and speech-to-text. They're very promising new models, but they appear to suffer from the ever-present risk of accidental (or malicious) instruction following.
  • THE DECODER: OpenAI has released a new generation of audio models that let developers customize how their AI assistants speak.
  • venturebeat.com: DeepSeek-V3 now runs at 20 tokens per second on Mac Studio, and that’s a nightmare for OpenAI
  • Last Week in AI: #204 - OpenAI Audio, Rubin GPUs, MCP, Zochi
Classification: