OpenAI launches GPT-Realtime-2: brings GPT-5 reasoning into voice agents, with context up to 128K

On May 7 (U.S. time), OpenAI unveiled three new Realtime speech models at its developer conference: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper, all available to developers via the Realtime API. In an official announcement, OpenAI said GPT-Realtime-2 is its first speech model with GPT-5–level reasoning capability, able to perform real-time inference in voice conversations, call tools, handle revisions, and maintain a natural conversational cadence.

GPT-Realtime-2: context up from 32K to 128K, five-stage reasoning strength adjustable

Core upgrades for GPT-Realtime-2:

context window: 32K to 128K tokens

Adjustable reasoning strength: minimal, low, medium, high, xhigh (five stages)

Big Bench Audio test: high reasoning at 96.6%, versus the prior GPT-Realtime-1.5 at 81.4%

Audio MultiChallenge instruction following: xhigh reasoning at 48.5%, versus the prior 34.7%

A larger context and adjustable reasoning strength let developers switch between “cheap and fast” and “deep thinking” based on the scenario—simple customer service can use minimal mode to control costs, while complex tasks can switch to xhigh for GPT-5–level reasoning quality.

Two specialized models are released alongside it: Translate for cross-language, and Whisper for real-time transcription.

This round’s three new models are split by function:

GPT-Realtime-Translate: real-time multilingual voice translation, supports 70 input languages and 13 output languages

GPT-Realtime-Whisper: low-latency streaming transcription, outputs text as people speak, suitable for live captions, meeting notes, and classroom verbatim transcripts

GPT-Realtime-2: a full dialogue agent—able to reason, use tools, and carry out actions

Translate and Whisper are specialized for specific speech applications—translation and transcription are more latency- and cost-sensitive than general dialogue, so using separate models can optimize their respective metrics.

Pricing: GPT-Realtime-2 is $32 per million input, and $64 per million output

Pricing structure for the three models:

GPT-Realtime-2: $32 per million voice input, cached input $0.40, output $64 per million

GPT-Realtime-Translate: $0.034 per minute

GPT-Realtime-Whisper: $0.017 per minute

Specific follow-up events to watch: GPT-Realtime-2’s real-world adoption of speech agents in production environments, the extent of cannibalization versus existing GPT-4o speech models, and how peers like Anthropic and Google respond in comparison.

This article, “OpenAI brings GPT-Realtime-2: brings GPT-5 reasoning into voice agents, context upgraded to 128K,” first appeared on LianNews ABMedia.

Disclaimer: The information on this page may come from third-party sources and is for reference only. It does not represent the views or opinions of Gate and does not constitute any financial, investment, or legal advice. Virtual asset trading involves high risk. Please do not rely solely on the information on this page when making decisions. For details, see the Disclaimer.
Comment
0/400
No comments