Table of contents
Open Table of contents
Introduction
Before Redis, before databases, before anything - you need to understand what you’re actually optimizing for. This is the foundation of system design: knowing whether you’re solving for speed or volume, and how to respond when your system needs to grow.
Lesson 1: Latency vs Throughput
Start With a Real Feeling
Imagine two restaurants:
- Restaurant A - Your food arrives in 5 minutes. But they can only serve 10 people per hour.
- Restaurant B - Your food arrives in 20 minutes. But they serve 200 people per hour.
Which is faster? → Restaurant A
Which handles more load? → Restaurant B
That’s exactly Latency vs Throughput.
The Definitions
| Term | Simple meaning | Technical meaning |
|---|---|---|
| Latency | How fast is one request? | Time taken for a single request to complete (ms) |
| Throughput | How many requests can you handle? | Number of requests processed per second (RPS) |
The Key Insight - They Are NOT the Same Thing
This is where most beginners go wrong. They think:
“If I make my system faster, it will handle more users too.”
Not always true.
A system can have:
- ✅ Low latency + ❌ Low throughput → Fast but can’t scale
- ❌ High latency + ✅ High throughput → Slow per user but handles millions
- ✅ Low latency + ✅ High throughput → The goal, but expensive and hard
- ❌ High latency + ❌ Low throughput → Broken system, fix immediately
A Real System Example
Think about an API you’ve built. Say a /getUserProfile endpoint.
Scenario 1:
Your API responds in 10ms. But if 1000 users hit it at the same time, it slows to 8 seconds.
- Latency is great (10ms) when alone
- Throughput is terrible (can’t handle concurrent load)
- Root cause: Probably a single database connection, no connection pooling, no caching
Scenario 2:
You add Redis caching. Now 1000 users all get responses in 12ms.
- Latency went up slightly (10ms → 12ms) - because of the cache lookup overhead
- Throughput massively improved - now handles 1000 concurrent users
- You traded a tiny bit of latency for massive throughput gain
This is a real trade-off decision. You made it better for the system even though one number got slightly worse.
When Do You Think About Each?
You optimize for Latency when:
- User is waiting and staring at a screen (search results, checkout, login)
- Real-time systems - gaming, live chat, stock prices
- The experience feels broken if it’s slow
You optimize for Throughput when:
- Background jobs - sending emails, processing payments, generating reports
- Data pipelines - logs, analytics, events
- The user doesn’t directly feel the wait
The Bottleneck Idea
Here’s how this connects to real system design thinking:
Your system’s throughput is always limited by its slowest part.
Like a highway - you can have 10 lanes, but if they all merge into 1 lane at a bridge, the bridge is your bottleneck. It doesn’t matter how wide the highway is.
In a real system:
User Request → Load Balancer → App Server → Database
✅ fast ✅ fast ❌ slow
It doesn’t matter that everything else is fast. The database is the bottleneck. Your entire system’s throughput is capped by how fast your DB can respond.
Finding the bottleneck is the most important skill in system design.
Exercise 1: Video Processing System
Scenario:
You are building a video processing system. When a user uploads a video, your system needs to
compress it, generate thumbnails, and extract metadata. Users upload about 500 videos per hour. The
processing for each video takes 2 minutes.
Questions:
- Is this a latency-sensitive or throughput-sensitive problem?
- Where is the likely bottleneck?
- What would you do first to fix it?
Analysis:
This is a throughput-sensitive problem because:
- It’s a background process - users don’t sit and wait
- Latency is expected and acceptable
- The CPU-intensive nature means we need to handle volume, not speed
The Bottleneck:
500 videos/hour arriving
↓
Single processor
(compress + thumbnail + metadata)
takes 2 min per video
↓
Can handle ~30 videos/hour
500 coming in. Only 30 going out. That gap is your bottleneck. The queue keeps growing and the system falls behind.
The Solution - Worker Queue Architecture:
User Uploads Video
↓
Message Queue
(holds all 500)
↓ ↓ ↓
W1 W2 W3 ← Multiple workers processing in parallel
You can also split thumbnail generation, metadata extraction, and compression into separate workers - this is called task decomposition.
Key Addition: What happens if a worker fails mid-process? The queue saves you - if a worker crashes, the message stays in the queue and another worker picks it up. That’s why we use queues like Kafka or RabbitMQ - not just for speed, but for fault tolerance.
Lesson 2: Scalability
You now know how to identify bottlenecks. Scalability is about what you do when you hit one.
Start With the Feeling
Your app just got featured on Product Hunt. Yesterday you had 100 users. Today you have 50,000. Your server is dying.
What do you do?
Most people’s first instinct is - “Make the server bigger.”
That’s valid. But it’s not always the right answer. And sometimes it’s a trap.
The Two Ways to Scale
Vertical Scaling - “Make it bigger”
Buy a more powerful machine. More CPU, more RAM, more storage.
Before: After:
[Small Server] [BIG Server]
2 CPU / 4GB RAM 32 CPU / 128GB RAM
Horizontal Scaling - “Add more of it”
Add more machines. Same size, just more of them.
Before: After:
[Server 1] [Server 1]
[Server 2]
[Server 3]
The Real Difference - And Why It Matters
| Vertical | Horizontal | |
|---|---|---|
| Cost | Gets exponentially expensive | Grows linearly |
| Limit | Hard ceiling - biggest machine has a max | Virtually unlimited |
| Complexity | Simple - nothing changes in your code | Complex - your code must handle it |
| Failure risk | One big machine = one big failure point | One machine dies, others continue |
| Speed to implement | Fast - just upgrade | Slower - needs architecture changes |
The trap with vertical scaling:
Doubling your server size does not double your capacity. But it does double your cost.
At some point, no single machine in the world is big enough. That’s the hard ceiling.
The Hidden Requirement for Horizontal Scaling
Here’s what nobody tells you upfront.
When you have 3 servers handling requests, a new problem appears:
User logs in → hits Server 1 → session stored on Server 1
Next request → hits Server 2 → "who are you?" ← Server 2 has no session
Your application needs to be stateless to scale horizontally.
Stateless means - the server does not remember anything about you between requests. All state lives somewhere shared - like a database or Redis.
This is why Redis becomes critical at scale. It’s not just a cache - it’s the shared memory for all your servers.
Server 1 ──→ Redis (shared session store) ←── Server 2
Server 3 ──→ ←── Server 4
Now any server can handle any request. This is called shared nothing architecture.
When Do You Choose Which?
Reach for Vertical first when:
- Early stage - your app is young, traffic is unpredictable
- Stateful systems that are hard to distribute (some legacy databases)
- You need a quick fix right now
- The cost jump is still reasonable
Reach for Horizontal when:
- Traffic is growing consistently and unpredictably
- You need fault tolerance - one server dying shouldn’t kill the app
- You’re building for millions of users
- Your application is or can be made stateless
A Real Decision Scenario
You’re building a food delivery app. You have these two parts:
- Part A - The main API (handles orders, user requests)
- Part B - The database (PostgreSQL, stores all orders)
Traffic spikes every evening 7-9pm. What do you scale and how?
Part A - API servers → Scale horizontally. Stateless. Easy to add more. During off-peak hours, scale back down. This is what cloud auto-scaling does.
Part B - Database → Trickier. You can’t just add 5 PostgreSQL servers like you add API servers - they all need the same data. So here you first go vertical. Then later you introduce techniques like read replicas and sharding.
Critical insight:
Different parts of your system scale differently. Your job is to identify which part is under pressure and apply the right scaling strategy to that specific part.
The Bigger Mental Model
Scalability is not a one-time decision. It’s a progression:
- Stage 1: One server does everything ← You start here
- Stage 2: Vertical scale - bigger server ← Quick fix
- Stage 3: Separate concerns - DB on its own server
- Stage 4: Horizontal scale - multiple app servers
- Stage 5: Caching layer - reduce DB pressure
- Stage 6: Database scaling - replicas, sharding
Most companies don’t start at Stage 6. They grow into it. Your job in an interview is to show you understand this progression - and can identify which stage a given system is at and what it needs next.
Exercise 2: News Website
Scenario:
You built a news website. 10 million people visit daily, mostly to read articles. Very few people
write articles - maybe 50 editors publishing content. Your database is under massive load and
responses are getting slow.
Questions:
- Is this a read-heavy or write-heavy system?
- Does it make more sense to scale vertically or horizontally here?
- What specific bottleneck are you solving?
Analysis:
This is a read-heavy system - 10 million readers vs 50 writers.
The Problem:
Without cache:
10M users → App Servers → Database (10M queries) ← dies
The Solution:
Horizontal scaling for API servers is correct - 10 million users, single server will die. But the real fix is reducing how many requests ever reach the database in the first place.
With cache:
10M users → App Servers → Cache (hit 90% of time) → Database (only 1M queries)
You didn’t make the database faster. You made it less needed.
The Complete Architecture:
10M users
↓
Load Balancer
↓
App Servers (horizontal - 5 to 10 servers)
↓
Redis Cache ← check here first (90% of articles served from here)
↓ (cache miss only)
Read Replicas (3 to 4 DB copies handling read traffic)
↓
Primary DB (only handles the 50 editors writing articles)
Read Replicas - A specific horizontal scaling pattern for databases:
All writes → Primary DB
All reads → Replica 1
Replica 2
Replica 3
One database handles all writes. Multiple copies handle all reads. Since 99% of this news site is reads - you distribute that load across replicas.
The Reusable Rule
Every time you see a read-heavy system in any problem, your brain should immediately think:
Cache first. Read replicas second. Primary DB only for writes.
This pattern appears in news sites, social media feeds, product catalogs, Wikipedia - anywhere reads massively outnumber writes.
Key Takeaways
- Latency (speed of one request) and Throughput (volume of requests) are different problems requiring different solutions
- Always identify the bottleneck - your system is only as fast as its slowest part
- Vertical scaling (bigger machine) is fast but has limits; Horizontal scaling (more machines) is unlimited but requires stateless architecture
- Different parts of your system scale differently - apply the right strategy to each component
- In read-heavy systems: Cache first, Read replicas second, Primary DB only for writes
This is part of a system design fundamentals series. Next up: CAP Theorem and the trade-offs in distributed systems.