News from the AI & ML world
Chris McKay@Maginative
//
OpenAI has recently unveiled new audio models based on GPT-4o, significantly enhancing its text-to-speech and speech-to-text capabilities. These new tools are intended to give AI agents a voice, enabling a range of applications, with demonstrations including the ability for an AI to read emails in character. The announcement includes the introduction of new transcription models, specifically gpt-4o-transcribe and gpt-4o-mini-transcribe, which are designed to outperform the existing Whisper model.
The text-to-speech and speech-to-text tools are based on GPT-4o. While these models show promise, some experts have noted potential vulnerabilities. Like other large language model (LLM)-driven multi-modal models, they appear susceptible to prompt-injection-adjacent issues, stemming from the mixing of instructions and data within the same token stream. OpenAI hinted it may take a similar path with video.
ImgSrc: www.maginative.
References :
- AI News | VentureBeat: OpenAI’s new voice AI model gpt-4o-transcribe lets you add speech to your existing text apps in seconds
- Analytics Vidhya: OpenAI’s Audio Models: How to Access, Features, Applications, and More
- Maginative: OpenAI Unveils New Audio Models to Make AI Agents Sound More Human Than Ever
- bsky.app: I published some notes on OpenAI's new text-to-speech and speech-to-text models.
- Samrat Man Singh: OpenAI announced some new audio models yesterday, including new transcription models( gpt-4o-transcribe and gpt-4o-mini-transcribe ).
- www.techrepublic.com: The text-to-speech and speech-to-text tools are all based on GPT-4o. OpenAI hinted it may take a similar path with video.
- MarkTechPost: Reports on OpenAI introducing advanced audio models.
- Simon Willison's Weblog: OpenAI announced today, for both text-to-speech and speech-to-text. They're very promising new models, but they appear to suffer from the ever-present risk of accidental (or malicious) instruction following.
- THE DECODER: OpenAI has released a new generation of audio models that let developers customize how their AI assistants speak.
- venturebeat.com: DeepSeek-V3 now runs at 20 tokens per second on Mac Studio, and that’s a nightmare for OpenAI
- Last Week in AI: #204 - OpenAI Audio, Rubin GPUs, MCP, Zochi
Classification: