Netflix VOD is a solved problem. Content is encoded ahead of time, cached on edge servers, and served from the closest node. You press play, it works.
Live streaming is an entirely different beast. Every video segment must be encoded, packaged, and delivered to millions of viewers within seconds. There's no "ahead of time." There's only "right now."
When Netflix introduced live streaming for events like the Tyson vs. Paul fight (65 million concurrent streams), they built a custom system called Live Origin to sit between their cloud encoding pipelines and Open Connect, their CDN. Here's how it works.
The Live Origin operates as a multi-tenant microservice on Amazon EC2 instances. The communication model is straightforward:
Encoder → Packager → [HTTP PUT] → Live Origin → [HTTP GET] → Open Connect → Viewers
Two architectural decisions shaped everything:
Redundant pipelines. Netflix runs two independent encoding pipelines simultaneously, each in a different cloud region with its own encoder, packager, and video feed. If one produces a bad segment, the other typically produces a good one.
Predictable segment templates. Instead of constantly updating a manifest to list available segments, each segment has a fixed duration of 2 seconds. The Origin can predict exactly when each segment should be published.
Live video feeds inevitably produce defects — short segments with missing frames, missing segments entirely, or timing discontinuities. Two independent pipelines in different regions substantially reduce the chance that both produce defective segments at the same time.
When Open Connect requests a segment, the Origin:
Pipeline A (us-east-1): segment_1042 → ✓ valid
Pipeline B (us-west-2): segment_1042 → ✗ missing frames
Origin decision: serve Pipeline A's segment
In the rare case where both are defective, the defect metadata passes downstream so clients can handle the error gracefully. No silent corruption.
Open Connect was built for VOD — years of tuning nginx for pre-positioned content. Live streaming didn't fit that model, so Netflix extended nginx's proxy-caching with several optimizations:
Reject invalid requests early. Open Connect nodes know the segment templates. If a request asks for a segment outside the legitimate range, it's rejected immediately without hitting the Origin.
Cache 404s intelligently. When a segment isn't available yet, the Origin returns a 404 with an expiration policy. Open Connect caches this 404 until just before the segment is expected, preventing repeated failed requests.
Hold requests at the live edge. When a request arrives for the next segment that's about to be published, instead of returning a 404 that propagates back to the client, the Origin holds the request open. Once the segment is published, it responds immediately. This eliminates a full round-trip for requests that arrive slightly early.
Millisecond-grain caching. Standard HTTP Cache-Control works at second granularity — too coarse when segments are generated every 2 seconds. Netflix added millisecond-precision caching to nginx.
Netflix uses custom HTTP headers to broadcast live streaming events at scale. The encoding pipeline sends notifications to the Origin, which inserts them as headers on subsequent segments. These headers are cumulative — they persist to future segments.
When a segment arrives at an Open Connect node, notifications are extracted from response headers and stored in memory. When serving to clients, the latest notification data is attached, regardless of where the viewer is in the stream.
This efficiently communicates ad breaks, content warnings, or live event updates to millions of devices, independent of playback position. No separate notification channel needed.
Netflix initially used AWS S3, similar to their VOD infrastructure. It worked for low-traffic events. Then they discovered live streaming's unique requirements:
S3 met its uptime guarantees, but the strict timing budget meant any delay was catastrophic. The requirements were closer to a global, low-latency, highly available database than object storage.
Netflix built on their Key-Value Storage Abstraction using Apache Cassandra:
The results: median latency dropped from 113ms to 25ms. P99 latency improved from 267ms to 129ms.
But there was still the Origin Storm problem — dozens of top-tier caches simultaneously requesting large segments. Worst-case read throughput could reach 100+ Gbps. Concurrent reads degraded write performance unacceptably.
The solution: write-through caching with EVCache (Netflix's distributed caching layer based on Memcached). Almost all reads are served from cache at 200+ Gbps throughput without affecting the write path. Only cache misses hit Cassandra.
Write path: Packager → Origin → KeyValue → Cassandra (+ EVCache write-through)
Read path: Open Connect → Origin → EVCache (hit) → response
→ Cassandra (miss) → EVCache → response
Not all requests have the same importance:
Netflix implemented complete path isolation:
During stress, priority-based rate limiting kicks in. Live edge traffic gets prioritized over DVR traffic. Detection uses the predictable segment template cached in memory — no datastore access needed, which is especially valuable during datastore stress.
For traffic surges, when low-priority traffic is impacted, the Origin sets max-age=5 and returns HTTP 503, telling Open Connect to cache the rejection for 5 seconds. This dampens traffic storms without manual intervention.
Traffic storms from requests for non-existent segments are handled through hierarchical metadata:
Event and rendition-level metadata benefits from high in-memory cache hit ratios at the Origin. During 404 storms, the control plane cache hit ratio exceeds 90%. No datastore stress.
Netflix's Live Origin succeeds because it respects the fundamental constraint of live streaming: time is the budget. Every design decision — redundant pipelines, predictive templates, held requests, write-through caching, priority rate limiting — exists to protect the 2-second window between encoding and delivery.
At 65 million concurrent streams during the Tyson vs. Paul fight, the system held. Not because it was clever, but because it was deliberately designed around its failure modes.
Design for the 2-second budget. Everything else follows.
— blanho
Synchronous calls work until they don't. Then you need a message queue. Here's why.
The hidden state in your servers is why you can't just 'add more boxes'.
High throughput doesn't mean low latency. Often it means the opposite.