System Design: Video Streaming (YouTube/Netflix)

Published: at 11:00 AM
(10 min read)

This is the final lesson in the series — and the most infrastructure-heavy system we’ve designed. Video streaming introduces problems you haven’t seen yet: encoding pipelines, adaptive bitrate streaming, and content delivery at a scale that makes everything else look small.


Requirements

Functional

  1. User uploads a video
  2. Video is processed and made available for streaming
  3. Users can stream smoothly on any device
  4. Support multiple quality levels — 360p, 720p, 1080p, 4K
  5. Video resumes from where the user left off
  6. Search for videos
  7. Recommendations

Non-Functional

  1. High availability — videos must always be watchable
  2. Low latency start — video begins playing within 2 seconds
  3. Smooth playback — no buffering
  4. Scale — 500 hours of video uploaded every minute (YouTube scale)
  5. Global reach — users everywhere get the same quality experience
  6. Storage efficiency — petabytes of video stored cost-effectively

Scale Estimation

Videos uploaded per minute: 500 hours
= 500 × 60 = 30,000 minutes of video per minute

Storage per video:
1 minute of raw video ≈ 1 GB (uncompressed)
After encoding ≈ 100 MB for all quality levels combined

Per day:
500 hours/min × 60 min × 24 hours = 720,000 hours uploaded/day
720,000 × 100 MB = 72 petabytes/day

Video views (YouTube scale):
5 billion views/day = ~58,000 streams/second

What this tells you:


The Upload and Processing Pipeline

This is the most unique part of video streaming. Unlike other systems where you store and retrieve data directly — video must be transformed before it can be streamed.

Why Raw Video Cannot Be Streamed Directly

User uploads from phone:
→ Shot in 4K at 60fps
→ File size: 4 GB for 10 minutes
→ Format: MOV / MP4 / AVI / MKV
→ Codec: H.265 or various others

Problems:
→ 4 GB file → mobile user waits forever to buffer
→ MOV format → not supported on all browsers
→ One quality level → terrible on slow connections
→ No chapters, thumbnails, or preview sprites

Every uploaded video must go through a processing pipeline before it’s watchable.

The Processing Pipeline

Step 1 — Upload to Object Storage

Client → pre-signed S3 URL → uploads directly to S3
App server never touches the video bytes
S3 handles petabytes of raw uploads

Step 2 — Trigger Processing

S3 upload complete → S3 event → Kafka "video.uploaded"
→ Video Processing Service picks up event

Step 3 — Validation

Step 4 — Transcoding ← most important step

Convert one raw video into multiple formats and qualities:
→ 360p  — slow mobile connections
→ 480p  — average mobile
→ 720p  — standard HD
→ 1080p — full HD
→ 4K    — premium users on fast connections
→ Each quality in multiple formats: MP4, WebM

Step 5 — Thumbnail Generation

→ Extract frames at regular intervals
→ Generate thumbnail images
→ Generate preview sprite (the tiny previews when hovering the timeline)

Step 6 — Store and Distribute

→ All transcoded files → S3
→ Thumbnails → S3
→ Update video metadata in database → status: "available"
→ CDN pulls processed files from S3
→ Distributes to edge nodes globally

Transcoding at Scale

500 hours uploaded per minute. Each video needs transcoding into 5 quality levels. That’s massive parallel computation.

Transcoding is CPU intensive:
1 minute of video → 5 quality levels → 5-10 minutes of CPU time

500 hours/min uploaded:
= 30,000 minutes of video/min
= 30,000 × 5 qualities
= 150,000 transcoding jobs per minute

Solution — parallel chunk transcoding:

Video uploaded

Kafka "video.uploaded"

Job Scheduler splits video into chunks:
→ Video split into 10-second segments
→ Each segment transcoded independently in parallel
→ Segments reassembled after transcoding

1 hour video = 360 ten-second segments
360 workers transcode simultaneously
→ 1 hour video ready in ~2 minutes instead of hours

This is the same MapReduce principle — split, process in parallel, reassemble.


Adaptive Bitrate Streaming (ABR)

This is the technology that makes Netflix and YouTube feel smooth even on variable connections.

The Problem

User on WiFi → 1080p playing perfectly
User switches to mobile data → connection slows
1080p requires 8 Mbps → user only has 2 Mbps
→ Video buffers → terrible experience ❌

The Solution — HLS (HTTP Live Streaming)

Instead of one video file — serve a playlist of small chunks:

master.m3u8 (master playlist):
→ Links to quality-specific playlists

1080p.m3u8:
→ segment001_1080p.ts  (10 seconds)
→ segment002_1080p.ts  (10 seconds)
→ segment003_1080p.ts  (10 seconds)
...

720p.m3u8:
→ segment001_720p.ts   (10 seconds)
→ segment002_720p.ts   (10 seconds)
...

360p.m3u8:
→ segment001_360p.ts   (10 seconds)
...

How the player uses this:

Every 10 seconds the player:
→ Measures current download speed
→ Decides which quality to request next

Download speed > 8 Mbps  → request 1080p next segment
Download speed 4–8 Mbps  → request 720p next segment
Download speed < 2 Mbps  → request 360p next segment

User never notices the switch
Player switches seamlessly between qualities
Buffer never empties → smooth playback ✅

This is why YouTube quality changes smoothly — it’s not one file, it’s thousands of small chunks served adaptively.


Content Delivery Architecture

This is where CDN becomes the entire architecture — not just an add-on.

Without CDN

58,000 concurrent streams
Each stream at 720p = 4 Mbps
Total bandwidth: 58,000 × 4 Mbps = 232 Gbps
All from your origin servers

→ Impossible to serve from one location
→ Terrible latency for users far away ❌

With CDN

58,000 streams distributed across hundreds of CDN edge nodes
→ User in Chennai served from Chennai edge node
→ User in London served from London edge node
→ Origin servers serve CDN nodes, not individual users
→ Origin bandwidth: fraction of total
→ Latency: minimal everywhere ✅

CDN Caching Strategy for Video

Popular videos (top 10% get 90% of views):
→ Cached at every CDN edge node globally
→ TTL: weeks or months
→ CDN hit rate: ~95%

Long tail videos (rarely watched):
→ Cached only at regional CDN nodes
→ Fetched from origin on first regional request
→ TTL: days

Very old / rarely watched:
→ Not cached at CDN
→ Served directly from S3 on demand
→ Cost optimised — no point caching what nobody watches

Data Model

Video Metadata — Cassandra

videos:
  video_id          UUID
  uploader_id       UUID
  title             text
  description       text
  status            enum (uploading/processing/available/removed)
  duration_seconds  int
  view_count        counter
  like_count        counter
  created_at        timestamp
  tags              list<text>
  category          text

  storage_paths:
    raw             s3://raw/videoId/original.mp4
    360p            s3://processed/videoId/360p.m3u8
    720p            s3://processed/videoId/720p.m3u8
    1080p           s3://processed/videoId/1080p.m3u8
    thumbnail       s3://thumbs/videoId/thumb.jpg

Why Cassandra: massive scale, simple video_id lookups, write-heavy (view counts updating constantly).

Watch History and Resume Position — Cassandra

watch_history:
  user_id           UUID   (partition key)
  video_id          UUID
  watched_at        timestamp
  watch_duration    int
  last_position     int    (seconds — for resume)
  completed         boolean

Why Cassandra: billions of watch events per day, time series append-only writes, access pattern is always “give me history for user X.”

Search Index — Elasticsearch

Video search is a separate problem entirely. Cassandra cannot do full-text search or relevance ranking. Elasticsearch handles:


Recommendations

Collaborative filtering:
"Users who watched video A also watched video B"
→ Recommend B to anyone who just watched A

Content-based filtering:
Tags, category, uploader
→ Recommend similar content

Implementation:

Watch events → Kafka → ML pipeline processes patterns
→ Precomputed recommendations stored in Redis
   Key: "next_videos:{videoId}" → list of video IDs
   TTL: 1 hour → refreshed regularly

When user finishes video:
→ Fetch recommendations from Redis instantly
→ No real-time ML computation in the request path

Pre-computation keeps latency low. You never run ML models live during a user request.


Resume Feature

User watches 40% of a video, closes app, returns next day:

Every 10 seconds while watching:
→ Client sends heartbeat: { userId, videoId, position: 245 }
→ App server writes to Redis:
   Key: "watch:{userId}:{videoId}"
   Value: 245 (seconds)
   TTL: 90 days

Why not write to database directly:
→ 10M active users × heartbeat every 10s = 1M writes/second
→ Database can't handle this
→ Redis handles it easily ✅

Async sync to Cassandra:
→ Background job syncs Redis positions to Cassandra every 5 minutes
→ Permanent durable record
→ Redis is the fast layer, Cassandra is the durable layer

Complete Architecture

Upload Path

User selects video

App Server issues pre-signed S3 URL

Client uploads directly to S3 (bypasses app servers)

S3 triggers event → Kafka "video.uploaded"

Video Processing Service:
→ Validates video
→ Splits into 10-second chunks
→ Distributes to Transcoding Workers (100s of them, parallel)
→ Reassembles transcoded segments
→ Generates thumbnails and preview sprites
→ Stores all files to S3
→ Updates status in Cassandra → "available"
→ Notifies uploader via notification system

Stream Path

User clicks play

App Server:
→ Fetch video metadata from Redis cache
→ Cache miss → Cassandra → store in Redis
→ Return master.m3u8 URL (CDN URL)

Video player fetches master.m3u8 from CDN

Player measures bandwidth → selects quality

Player fetches 10-second segments from CDN:
→ CDN hit  → served instantly from edge ✅
→ CDN miss → CDN fetches from S3 → caches → serves

Every 10s → player fetches next segment
Every 10s → client sends position heartbeat to Redis

Adaptive quality switching happens transparently

Supporting Systems

View counts:
→ Stream start → Kafka "video.viewed"
→ Redis counter incremented
→ Async sync to Cassandra every 60 seconds

Search:
→ Video metadata indexed in Elasticsearch on publish
→ Search queries hit Elasticsearch
→ Returns IDs → details fetched from Redis/Cassandra

Recommendations:
→ Watch events → Kafka → ML pipeline
→ Precomputed results → Redis
→ Served instantly on video end

Connecting Every Lesson

This system uses every concept from the series:

LessonWhere it appears in video streaming
Latency vs Throughput (1)Stream start is latency-sensitive (<2s); transcoding is throughput-sensitive
Scalability (2)Transcoding workers scale horizontally; CDN scales delivery globally
CAP Theorem (3)Streaming → AP (staleness fine); payment → CP (consistency mandatory)
Consistency (4)Watch position → eventual OK; uploader sees their video → read-your-own-writes
Load Balancers (5)Distribute upload and stream requests; balance transcoding worker pool
Caching (6)Redis for metadata, positions, counters, recommendations; CDN at edge
Databases (7)Cassandra for metadata/history; Elasticsearch for search; S3 for files
Message Queues (8)Kafka decouples upload from processing, carries view events, feeds ML
CDN (9)CDN is the streaming architecture — without it this system cannot exist

The Key Insight

The most important principle this system teaches:

The solution is never “find a faster server.” The solution is to never let the bottleneck see the load directly.

Every layer shields the one below it from the full force of the traffic.


What to Practice Next

You’ve covered the theory and design of 14 systems. The next step is practice:

Week 1–2 — Solo practice: Pick any app you use daily. Design it from scratch using the 7-step framework. Time yourself — 45 minutes per system.

Systems to tackle next:

Week 3–4 — Mock interviews: Practice explaining designs out loud. The thinking is right — now build the communication. Record yourself and review.

Go deeper on:


Every system design problem you’ll ever face reduces to the same questions you’ve been answering since Lesson 1:

What are we optimizing for? Where will it break? What’s the right trade-off? What’s the simplest solution that actually works?

You came in knowing the tools. You leave knowing how to think. That’s the difference that matters in interviews and in real engineering.