OpenAI announced on May 4 details of its voice AI infrastructure fundamentals—designed to support a voice AI service for 900 million weekly active users (Weekly Active Users). The team rebuilt the WebRTC stack, changing the media connection layer from a traditional architecture of “one port per one session” to a thin relay written in Go, with all WebRTC session state centralized in a service called “transceiver.” In an explanation on OpenAI’s official blog, the company says this architecture also supports ChatGPT’s voice mode, the Realtime API, and multiple research projects. For any team building voice AI engineering, this is rare technical literature on “how voice AI at global scale gets supported.”

Three technical constraints: traditional WebRTC hits a wall at OpenAI scale

In the article, OpenAI’s engineering team clearly points out three limitations that “collide with each other when scaled up”:

The traditional media termination method of “one session, one port” is unsuitable for OpenAI infrastructure—when 900 million users may open voice sessions at the same time, a design where each session occupies one port would exhaust network resources

Stateful ICE (Interactive Connectivity Establishment) and DTLS (Datagram Transport Layer Security) sessions require a stable owner—in distributed systems, if session state is shared across multiple services, fault tolerance and migration become extremely complex

Global routing must maintain low first-hop latency—the “natural feel” of voice AI depends on smooth turn-taking, and if the first hop exceeds 100 ms, it becomes noticeably choppy

OpenAI’s requirements list is also explicit: global reach (covering 900 million+ users), fast session establishment (users can speak as soon as they open their mouth), and low and stable media round-trip time (including low jitter and packet loss).

The solution: thin Go relay + centralized transceiver service

OpenAI’s architecture is divided into two layers:

Relay layer—written in Go, intentionally kept simple to implement. A typical Go process reads packet headers from sockets, updates a small amount of traffic state, forwards packets, and does not terminate WebRTC. This is the key to making relay horizontally scalable—not needing to maintain full WebRTC state, relay-to-relay interchange is painless, and a single point of failure will not interrupt the entire session.

Transceiver layer—the only service that owns WebRTC session state, including ICE connection checks, DTLS handshakes, SRTP encryption keys, and session lifecycle management. By concentrating these states into a single service, session ownership becomes easier to reason about; backend services can scale like ordinary services, without each having to be a WebRTC peer.

The clever part of this design is that it strictly separates “the stateful parts” from “the stateless parts.” Relay is a stateless data plane (replicable at scale), while transceiver is a stateful control plane (few in number, but state-complete). This layering also lets OpenAI scale horizontally as needed without worrying about an explosion in WebRTC connection counts.

Why Go: the decision logic for voice AI engineering

OpenAI’s article clearly explains that relay is written in Go and intentionally kept narrow in scope. The engineering logic behind this choice:

Go’s goroutines provide native support for IO-bound tasks such as “handling tens of thousands of connections per port,” without needing a complex event loop

The standard library’s net package can directly read UDP packets, without binding to C libraries

After compilation, it becomes a single static binary—easy to deploy, containerize, and run across architectures (x86/ARM)

Go’s memory management is friendly to “large numbers of short-lived objects” (each packet needs parsing), with controllable GC pause times

This also explains why Go’s penetration keeps rising in modern infrastructure layers (Cloudflare, Tailscale, HashiCorp, etc.)—not because “Go is more awesome than another language,” but because “in IO-bound, horizontally scalable infrastructure scenarios, Go is the most straightforward to write.”

Cloudflare’s counterpart: the Realtime Voice AI battlefield

In the same period (early May), Cloudflare also published a technical blog post, “Cloudflare is the best place to build real-time voice agents,” and presented its own方案 in parallel with OpenAI. The two approaches diverge:

OpenAI: build its own WebRTC relay/transceiver stack, not relying on third parties, bringing the media layer into its own technology stack

Cloudflare: treat the WebRTC media service as an extension of its Workers platform, giving developers “one-stop” infrastructure

For AI application teams of small to mid size, the Cloudflare route is more practical—developers can call APIs directly without building WebRTC infrastructure. For teams at OpenAI’s scale, building in-house is the necessary path—an external service’s SLA, billing structure, and geographic distribution can’t match perfectly.

Next observations: transceiver open-sourcing, Realtime API scale, competitor responses

Key focus areas for the next phase:

Whether OpenAI will open-source the transceiver/relay parts—competitors such as Anthropic, Google, and xAI are all building their own voice stacks; if OpenAI open-sources, it could become an industry standard

Realtime API pricing and scale—currently it mainly relies on ChatGPT subscriptions to subsidize; if API revenue grows, will it become an independent product line

Corresponding moves from Anthropic and Google—Claude and Gemini already support voice, but compared with OpenAI, they still lag in latency and scale; will this technical disclosure accelerate their engineering follow-through?

For Taiwan’s AI application developers, voice AI is a key battlefield in the second half of 2026—scenarios like customer service, education, in-car, and IoT all need low-latency conversation. OpenAI’s engineering disclosure this time is one of the most important references for deciding whether to build your own voice stack or use third-party APIs.

This article, OpenAI rebuilds its WebRTC voice stack: 900M weekly active users, with Go-written relay as the core, first appeared on 鏈新聞 ABMedia.

Disclaimer: The information on this page may come from third parties and does not represent the views or opinions of Gate. The content displayed on this page is for reference only and does not constitute any financial, investment, or legal advice. Gate does not guarantee the accuracy or completeness of the information and shall not be liable for any losses arising from the use of this information. Virtual asset investments carry high risks and are subject to significant price volatility. You may lose all of your invested principal. Please fully understand the relevant risks and make prudent decisions based on your own financial situation and risk tolerance. For details, please refer to Disclaimer.