Voice assistants have always felt a little clunky.
OpenAI just dropped three models that might finally change that.
Three Models, One Big Upgrade
On Thursday, May 7, 2026, OpenAI announced three new voice models for its Realtime API.
Each one handles a different job. Together, they’re designed to turn voice interfaces from simple question-and-answer tools into systems that can actually get work done during a live conversation.

Here’s the lineup:
GPT-Realtime-2 – A voice model with GPT-5-class reasoning. It doesn’t just listen and respond. It thinks through complex requests, calls tools, handles interruptions, and keeps the conversation flowing naturally.
GPT-Realtime-Translate – A live translation model that converts speech from over 70 input languages into 13 output languages in real time. It keeps pace with the speaker, even when they switch topics or use regional accents.
GPT-Realtime-Whisper – A streaming speech-to-text model built for ultra-low latency. It transcribes what you say while you’re still saying it.
“Together, the models we are launching move real-time audio from simple call-and-response toward voice interfaces that can actually do work,” OpenAI said.
What Makes GPT-Realtime-2 Different
Previous voice models from OpenAI could hold a conversation. But they struggled with longer, more complicated tasks. GPT-Realtime-2 changes that in several key ways.
First, the context window jumped from 32K to 128K tokens. That means the model can hold onto far more information during a conversation – crucial for customer support calls or detailed planning sessions.
Second, the model now uses “preambles.” Instead of going silent while it works on a request, it says things like “let me check that” or “one moment while I look into it.” Small detail, big difference. It makes the AI feel less robotic and more like an actual conversation partner.
Third, GPT-Realtime-2 can call multiple tools at the same time and tell the user what it’s doing — “checking your calendar” or “looking that up now.” If something goes wrong, it recovers gracefully instead of crashing the conversation.
The Benchmarks Back It Up
On Big Bench Audio, which tests reasoning capabilities in voice models, GPT-Realtime-2 scored 96.6% accuracy at the high reasoning level.
That’s a significant jump from the 81.4% scored by its predecessor, GPT-Realtime-1.5. On Audio MultiChallenge, which measures instruction-following in multi-turn conversations, the new model hit 48.5% compared to 34.7%.
Developers can also dial the reasoning level up or down – from minimal to xhigh – depending on whether speed or depth matters more for their use case.
Live Translation That Actually Keeps Up
GPT-Realtime-Translate tackles one of the trickiest problems in voice AI: real-time multilingual conversation.
The model is designed to preserve meaning and context even when speakers talk fast, change topics mid-sentence, or use industry-specific vocabulary. =
Companies like Deutsche Telekom are already exploring the model for cross-language customer interactions.For businesses serving global audiences, this could eliminate the need for separate multilingual support teams, or at least make them far more efficient.
Who’s Going to Use This?
Customer service is the obvious answer. Companies can now build voice agents that handle complex queries, troubleshoot issues, and take action – all within a single phone call.
But OpenAI sees the use cases going much wider. Education platforms could use live transcription for classroom accessibility. Media companies could add real-time captions to live broadcasts.
Event organizers could offer instant translation for international conferences. Creator platforms could use voice-driven workflows to speed up production.
Zillow, for example, has already been testing GPT-Realtime-2 for voice interactions with customers and reported improvements in call success rates.
What About Misuse?
Powerful voice AI raises obvious concerns. Could someone use these tools for spam calls? Fraud? Impersonation?
OpenAI says it has built active classifiers into the system that can detect and halt harmful content in real time. Conversations that violate content guidelines can be stopped automatically. The company also provides developers with additional safety tools to layer on their own guardrails.
Whether those protections hold up at scale remains to be seen. Voice-based scams are already a growing problem, and tools this capable could raise the stakes significantly.
What It Costs
All three models are available now through the Realtime API. Here’s the pricing breakdown:
GPT-Realtime-2 costs $32 per million audio input tokens (with $0.40 for cached inputs) and $64 per million audio output tokens. GPT-Realtime-Translate runs $0.034 per minute. GPT-Realtime-Whisper costs $0.017 per minute.
Developers can test all three models in the OpenAI Playground.
Why This Matters Beyond Developers
Right now, these tools are aimed at developers building apps on OpenAI’s platform. Regular ChatGPT users won’t see these models directly – OpenAI says it’s still working on upgrading the consumer voice experience.
But the ripple effects will show up fast. Every customer service bot, translation app, and transcription tool built on these models will feel more natural, more responsive, and more capable.
Voice AI has always promised to change how we interact with technology. With GPT-Realtime-2, that promise just got a lot more concrete.

