System Design: URL Shortener

Published: at 10:45 AM
(7 min read)

Table of contents

Open Table of contents

Introduction

A URL shortener is the perfect first system design problem. Simple enough to understand fully, deep enough to touch every concept — scaling, caching, databases, availability, and algorithm design.

This post follows a universal framework applicable to every system design problem.


The Framework

  1. Understand the requirements
  2. Estimate the scale
  3. Design the API
  4. Design the data model
  5. High-level architecture
  6. Deep dive into bottlenecks
  7. Handle failures

Step 1: Requirements

Functional:

Non-functional:


Step 2: Scale Estimation

Numbers drive architecture decisions.

Writes (URL creation):

100 million new URLs/day
100,000,000 / 86,400 = ~1,200 writes/second

Reads (redirects):

Assume 100:1 read-to-write ratio
10 billion redirects/day
10,000,000,000 / 86,400 = ~115,000 reads/second

Storage:

~500 bytes per URL record
100M URLs/day × 365 days × 10 years = 365B records
365B × 500 bytes = ~180 TB over 10 years

Summary:


Step 3: API Design

Create short URL:

POST /api/urls
{ longUrl: "...", customSlug: "mylink", expiryDays: 30 }

Response:
{ shortUrl: "https://bit.ly/abc123", slug: "abc123", expiresAt: "..." }

Redirect:

GET /{slug}
→ HTTP 301 or 302 redirect to long URL

301 vs 302 — matters more than it looks:

301 (Permanent)302 (Temporary)
Browser caches redirectYes — foreverNo
Every click hits your serverNoYes
Analytics trackingLost after first clickEvery click tracked

If you need click analytics → use 302. If you want to minimise server load → use 301.


Step 4: Data Model

URLs table:

id          → unique identifier (primary key)
slug        → "abc123" (indexed, unique)
long_url    → original URL
created_at  → timestamp
expires_at  → nullable
user_id     → nullable
click_count → integer

Which database?

Apply the access pattern framework:

Cassandra for URL data (massive scale, simple key lookup, AP system)

PostgreSQL for user accounts (relational, ACID for payments/subscriptions)

This is polyglot persistence in practice.


Step 5: Slug Generation

This is the unique algorithmic challenge. Three approaches:

Approach 1: Random String

Generate random 6-7 character base62 string.

Base62 = a-z + A-Z + 0-9 = 62 characters
7 chars = 62^7 = 3.5 trillion combinations ✅

Problem: collisions are possible. Need to check DB on every generation — expensive at scale.

Approach 2: MD5/SHA256 Hash

Hash the long URL, take first 7 characters.

Problem: different URLs can produce same first 7 chars. Same long URL always produces same hash — two users shortening the same URL get the same slug, breaking per-user analytics.

Approach 3: Auto-Increment ID + Base62 Encoding ✅

Every new URL gets an auto-incrementing ID. Encode that ID in base62.

URL 1       → ID: 1          → slug: "1"
URL 1000000 → ID: 1000000    → slug: "4c92"
URL 10B     → ID: 10000000000 → slug: "aMLpW" (7 chars)

Why this wins:

One downside: sequential IDs expose how many URLs have been created. Fix: add a random offset or shuffle the base62 alphabet privately.


Step 6: Full Architecture

                       [User]

                  [Load Balancer]
                  (Active-Active)
                  ↓           ↓
           [App Server 1] [App Server 2]

           [Redis Cache]
           (Cache Aside, 24hr TTL)
                  ↓ (cache miss only)
           [Cassandra Cluster]
           (slug → long URL, replicated)

           [PostgreSQL]
           (user accounts, custom domains)

           [Analytics Service]
           ← Kafka topic: "url.clicked"
           (async — never slows the redirect)

Create URL flow:

POST /api/urls
→ Load Balancer → App Server
→ Generate auto-increment ID
→ Encode to Base62 slug
→ Save to Cassandra
→ Return short URL

Redirect flow:

GET /abc123
→ Load Balancer → App Server
→ Check Redis cache for "abc123"
→ Cache hit → redirect instantly ✅
→ Cache miss → lookup Cassandra
             → store in Redis (24hr TTL)
             → redirect user

Why the Cache Layer Is Critical Here

At 115,000 reads/second, popular slugs get hit millions of times. Without cache, Cassandra absorbs all of it unnecessarily.

With Redis (Cache Aside):


Step 7: Failure Handling

Redis goes down:

All reads fall through to Cassandra
System stays alive — degraded performance only
Redis restarts → cache warms up gradually

App server goes down:

Load Balancer health checks detect failure
Stops routing to failed server
Auto-scaling spins up replacement

Cassandra node goes down:

Data replicated across nodes (replication factor 3)
Other nodes serve the data — no downtime
No data loss

URL expiry:

Check expires_at on every read
If expired → return 404
Background job runs daily → deletes expired records

Extension: Adding Analytics Dashboard

The ask: show URL creators click counts, countries, devices, over time.

Architecture:

User clicks short URL
→ Redirect happens instantly (critical path, unchanged)
→ Simultaneously publish to Kafka topic "url.clicked"
  { slug, userId, country, device, timestamp }

Analytics pipeline:
Kafka → Stream Processor (Flink/Spark Streaming)
      → Aggregates clicks per slug/country/device/hour
      → Stores in ClickHouse (columnar, fast aggregations)

Dashboard API:
→ Check Redis: "analytics:user456:last30days" (15 min TTL)
→ Cache hit → serve immediately
→ Cache miss → query ClickHouse → cache → serve

Key principle: analytics is completely off the critical path. The redirect never waits for analytics processing.

Why ClickHouse over Cassandra for analytics storage:

ClickHouse is a columnar database optimised for aggregation queries — “sum clicks grouped by country over last 30 days.” Cassandra is optimised for key lookups, not aggregations.


Extension: Custom Domains

The ask: premium users use their own domain — mycompany.com/abc123 instead of bit.ly/abc123.

New components needed:

custom_domains table (PostgreSQL):
  domain        → "mycompany.com"
  user_id       → "u456"
  verified      → true
  ssl_cert_status → "provisioned"

Request flow:

mycompany.com/abc123 arrives
→ DNS points mycompany.com to your servers
→ App server reads Host header → "mycompany.com"
→ Lookup custom_domains table → find user u456
→ Lookup slug "abc123" for user u456
→ Redirect

Why PostgreSQL for custom_domains, not Cassandra:

Rule: use Cassandra when scale is massive and structure is simple. Use PostgreSQL when data is relational, transactional, and volume is manageable.

The real engineering challenge: SSL certificate provisioning per custom domain. Every mycompany.com needs its own HTTPS cert. Services like Let’s Encrypt automate this, but it’s the hardest part of the feature.


What Makes This a Strong Interview Answer

Most answers stop at: “Store URLs in a database, look up on click, redirect.”

A complete answer adds:


Key Takeaways

  1. Scale estimation first — numbers decide architecture.
  2. Auto-increment ID + Base62 encoding is the cleanest slug generation approach.
  3. Redis cache is not optional here — 115,000 reads/second needs it.
  4. Different data types → different databases (polyglot persistence).
  5. Always decouple non-critical work (analytics, notifications) via queues.
  6. Every component that can fail needs a failure plan.

Part of the system design series. Next: designing a notification system — fan-out at scale.