System Design Fundamentals: Performance & Scale

Open Table of contents

Introduction
Lesson 1: Latency vs Throughput
Lesson 2: Scalability
Key Takeaways

Introduction

Before Redis, before databases, before anything - you need to understand what you’re actually optimizing for. This is the foundation of system design: knowing whether you’re solving for speed or volume, and how to respond when your system needs to grow.

Lesson 1: Latency vs Throughput

Start With a Real Feeling

Imagine two restaurants:

Restaurant A - Your food arrives in 5 minutes. But they can only serve 10 people per hour.
Restaurant B - Your food arrives in 20 minutes. But they serve 200 people per hour.

Which is faster? → Restaurant A
Which handles more load? → Restaurant B

That’s exactly Latency vs Throughput.

The Definitions

Term	Simple meaning	Technical meaning
Latency	How fast is one request?	Time taken for a single request to complete (ms)
Throughput	How many requests can you handle?	Number of requests processed per second (RPS)

The Key Insight - They Are NOT the Same Thing

This is where most beginners go wrong. They think:

“If I make my system faster, it will handle more users too.”

Not always true.

A system can have:

✅ Low latency + ❌ Low throughput → Fast but can’t scale
❌ High latency + ✅ High throughput → Slow per user but handles millions
✅ Low latency + ✅ High throughput → The goal, but expensive and hard
❌ High latency + ❌ Low throughput → Broken system, fix immediately

A Real System Example

Think about an API you’ve built. Say a /getUserProfile endpoint.

Scenario 1:
Your API responds in 10ms. But if 1000 users hit it at the same time, it slows to 8 seconds.

Latency is great (10ms) when alone
Throughput is terrible (can’t handle concurrent load)
Root cause: Probably a single database connection, no connection pooling, no caching

Scenario 2:
You add Redis caching. Now 1000 users all get responses in 12ms.

Latency went up slightly (10ms → 12ms) - because of the cache lookup overhead
Throughput massively improved - now handles 1000 concurrent users
You traded a tiny bit of latency for massive throughput gain

This is a real trade-off decision. You made it better for the system even though one number got slightly worse.

When Do You Think About Each?

You optimize for Latency when:

User is waiting and staring at a screen (search results, checkout, login)
Real-time systems - gaming, live chat, stock prices
The experience feels broken if it’s slow

You optimize for Throughput when:

Background jobs - sending emails, processing payments, generating reports
Data pipelines - logs, analytics, events
The user doesn’t directly feel the wait

The Bottleneck Idea

Here’s how this connects to real system design thinking:

Your system’s throughput is always limited by its slowest part.

Like a highway - you can have 10 lanes, but if they all merge into 1 lane at a bridge, the bridge is your bottleneck. It doesn’t matter how wide the highway is.

In a real system:

User Request → Load Balancer → App Server → Database
                   ✅ fast          ✅ fast       ❌ slow

It doesn’t matter that everything else is fast. The database is the bottleneck. Your entire system’s throughput is capped by how fast your DB can respond.

Finding the bottleneck is the most important skill in system design.

Exercise 1: Video Processing System

Scenario:
You are building a video processing system. When a user uploads a video, your system needs to compress it, generate thumbnails, and extract metadata. Users upload about 500 videos per hour. The processing for each video takes 2 minutes.

Questions:

Is this a latency-sensitive or throughput-sensitive problem?
Where is the likely bottleneck?
What would you do first to fix it?

Analysis:

This is a throughput-sensitive problem because:

It’s a background process - users don’t sit and wait
Latency is expected and acceptable
The CPU-intensive nature means we need to handle volume, not speed

The Bottleneck:

500 videos/hour arriving
        ↓
  Single processor
  (compress + thumbnail + metadata)
  takes 2 min per video
        ↓
  Can handle ~30 videos/hour

500 coming in. Only 30 going out. That gap is your bottleneck. The queue keeps growing and the system falls behind.

The Solution - Worker Queue Architecture:

User Uploads Video
        ↓
   Message Queue
  (holds all 500)
     ↓  ↓  ↓
  W1  W2  W3  ← Multiple workers processing in parallel

You can also split thumbnail generation, metadata extraction, and compression into separate workers - this is called task decomposition.

Key Addition: What happens if a worker fails mid-process? The queue saves you - if a worker crashes, the message stays in the queue and another worker picks it up. That’s why we use queues like Kafka or RabbitMQ - not just for speed, but for fault tolerance.

Lesson 2: Scalability

You now know how to identify bottlenecks. Scalability is about what you do when you hit one.

Start With the Feeling

Your app just got featured on Product Hunt. Yesterday you had 100 users. Today you have 50,000. Your server is dying.

What do you do?

Most people’s first instinct is - “Make the server bigger.”

That’s valid. But it’s not always the right answer. And sometimes it’s a trap.

The Two Ways to Scale

Vertical Scaling - “Make it bigger”

Buy a more powerful machine. More CPU, more RAM, more storage.

Before:          After:
[Small Server]   [BIG Server]
2 CPU / 4GB RAM  32 CPU / 128GB RAM

Horizontal Scaling - “Add more of it”

Add more machines. Same size, just more of them.

Before:          After:
[Server 1]       [Server 1]
                 [Server 2]
                 [Server 3]

The Real Difference - And Why It Matters

	Vertical	Horizontal
Cost	Gets exponentially expensive	Grows linearly
Limit	Hard ceiling - biggest machine has a max	Virtually unlimited
Complexity	Simple - nothing changes in your code	Complex - your code must handle it
Failure risk	One big machine = one big failure point	One machine dies, others continue
Speed to implement	Fast - just upgrade	Slower - needs architecture changes

The trap with vertical scaling:

Doubling your server size does not double your capacity. But it does double your cost.

At some point, no single machine in the world is big enough. That’s the hard ceiling.

The Hidden Requirement for Horizontal Scaling

Here’s what nobody tells you upfront.

When you have 3 servers handling requests, a new problem appears:

User logs in → hits Server 1 → session stored on Server 1
Next request  → hits Server 2 → "who are you?" ← Server 2 has no session

Your application needs to be stateless to scale horizontally.

Stateless means - the server does not remember anything about you between requests. All state lives somewhere shared - like a database or Redis.

This is why Redis becomes critical at scale. It’s not just a cache - it’s the shared memory for all your servers.

Server 1  ──→  Redis (shared session store)  ←──  Server 2
Server 3  ──→                                ←──  Server 4

Now any server can handle any request. This is called shared nothing architecture.

When Do You Choose Which?

Reach for Vertical first when:

Early stage - your app is young, traffic is unpredictable
Stateful systems that are hard to distribute (some legacy databases)
You need a quick fix right now
The cost jump is still reasonable

Reach for Horizontal when:

Traffic is growing consistently and unpredictably
You need fault tolerance - one server dying shouldn’t kill the app
You’re building for millions of users
Your application is or can be made stateless

A Real Decision Scenario

You’re building a food delivery app. You have these two parts:

Part A - The main API (handles orders, user requests)
Part B - The database (PostgreSQL, stores all orders)

Traffic spikes every evening 7-9pm. What do you scale and how?

Part A - API servers → Scale horizontally. Stateless. Easy to add more. During off-peak hours, scale back down. This is what cloud auto-scaling does.

Part B - Database → Trickier. You can’t just add 5 PostgreSQL servers like you add API servers - they all need the same data. So here you first go vertical. Then later you introduce techniques like read replicas and sharding.

Critical insight:

Different parts of your system scale differently. Your job is to identify which part is under pressure and apply the right scaling strategy to that specific part.

The Bigger Mental Model

Scalability is not a one-time decision. It’s a progression:

Stage 1: One server does everything ← You start here
Stage 2: Vertical scale - bigger server ← Quick fix
Stage 3: Separate concerns - DB on its own server
Stage 4: Horizontal scale - multiple app servers
Stage 5: Caching layer - reduce DB pressure
Stage 6: Database scaling - replicas, sharding

Most companies don’t start at Stage 6. They grow into it. Your job in an interview is to show you understand this progression - and can identify which stage a given system is at and what it needs next.

Exercise 2: News Website

Scenario:
You built a news website. 10 million people visit daily, mostly to read articles. Very few people write articles - maybe 50 editors publishing content. Your database is under massive load and responses are getting slow.

Questions:

Is this a read-heavy or write-heavy system?
Does it make more sense to scale vertically or horizontally here?
What specific bottleneck are you solving?

Analysis:

This is a read-heavy system - 10 million readers vs 50 writers.

The Problem:

Without cache:
10M users → App Servers → Database (10M queries) ← dies

The Solution:

Horizontal scaling for API servers is correct - 10 million users, single server will die. But the real fix is reducing how many requests ever reach the database in the first place.

With cache:
10M users → App Servers → Cache (hit 90% of time) → Database (only 1M queries)

You didn’t make the database faster. You made it less needed.

The Complete Architecture:

10M users
    ↓
Load Balancer
    ↓
App Servers (horizontal - 5 to 10 servers)
    ↓
Redis Cache ← check here first (90% of articles served from here)
    ↓ (cache miss only)
Read Replicas (3 to 4 DB copies handling read traffic)
    ↓
Primary DB (only handles the 50 editors writing articles)

Read Replicas - A specific horizontal scaling pattern for databases:

All writes → Primary DB
All reads  → Replica 1
            Replica 2
            Replica 3

One database handles all writes. Multiple copies handle all reads. Since 99% of this news site is reads - you distribute that load across replicas.

The Reusable Rule

Every time you see a read-heavy system in any problem, your brain should immediately think:

Cache first. Read replicas second. Primary DB only for writes.

This pattern appears in news sites, social media feeds, product catalogs, Wikipedia - anywhere reads massively outnumber writes.

Key Takeaways

Latency (speed of one request) and Throughput (volume of requests) are different problems requiring different solutions
Always identify the bottleneck - your system is only as fast as its slowest part
Vertical scaling (bigger machine) is fast but has limits; Horizontal scaling (more machines) is unlimited but requires stateless architecture
Different parts of your system scale differently - apply the right strategy to each component
In read-heavy systems: Cache first, Read replicas second, Primary DB only for writes

This is part of a system design fundamentals series. Next up: CAP Theorem and the trade-offs in distributed systems.