System Design Fundamentals: Performance & Scale

Published: at 10:30 AM
(9 min read)

Table of contents

Open Table of contents

Introduction

Before Redis, before databases, before anything - you need to understand what you’re actually optimizing for. This is the foundation of system design: knowing whether you’re solving for speed or volume, and how to respond when your system needs to grow.

Lesson 1: Latency vs Throughput

Start With a Real Feeling

Imagine two restaurants:

Which is faster? → Restaurant A
Which handles more load? → Restaurant B

That’s exactly Latency vs Throughput.

The Definitions

TermSimple meaningTechnical meaning
LatencyHow fast is one request?Time taken for a single request to complete (ms)
ThroughputHow many requests can you handle?Number of requests processed per second (RPS)

The Key Insight - They Are NOT the Same Thing

This is where most beginners go wrong. They think:

“If I make my system faster, it will handle more users too.”

Not always true.

A system can have:

A Real System Example

Think about an API you’ve built. Say a /getUserProfile endpoint.

Scenario 1:
Your API responds in 10ms. But if 1000 users hit it at the same time, it slows to 8 seconds.

Scenario 2:
You add Redis caching. Now 1000 users all get responses in 12ms.

This is a real trade-off decision. You made it better for the system even though one number got slightly worse.

When Do You Think About Each?

You optimize for Latency when:

You optimize for Throughput when:

The Bottleneck Idea

Here’s how this connects to real system design thinking:

Your system’s throughput is always limited by its slowest part.

Like a highway - you can have 10 lanes, but if they all merge into 1 lane at a bridge, the bridge is your bottleneck. It doesn’t matter how wide the highway is.

In a real system:

User Request → Load Balancer → App Server → Database
                   ✅ fast          ✅ fast       ❌ slow

It doesn’t matter that everything else is fast. The database is the bottleneck. Your entire system’s throughput is capped by how fast your DB can respond.

Finding the bottleneck is the most important skill in system design.

Exercise 1: Video Processing System

Scenario:
You are building a video processing system. When a user uploads a video, your system needs to compress it, generate thumbnails, and extract metadata. Users upload about 500 videos per hour. The processing for each video takes 2 minutes.

Questions:

  1. Is this a latency-sensitive or throughput-sensitive problem?
  2. Where is the likely bottleneck?
  3. What would you do first to fix it?

Analysis:

This is a throughput-sensitive problem because:

The Bottleneck:

500 videos/hour arriving

  Single processor
  (compress + thumbnail + metadata)
  takes 2 min per video

  Can handle ~30 videos/hour

500 coming in. Only 30 going out. That gap is your bottleneck. The queue keeps growing and the system falls behind.

The Solution - Worker Queue Architecture:

User Uploads Video

   Message Queue
  (holds all 500)
     ↓  ↓  ↓
  W1  W2  W3  ← Multiple workers processing in parallel

You can also split thumbnail generation, metadata extraction, and compression into separate workers - this is called task decomposition.

Key Addition: What happens if a worker fails mid-process? The queue saves you - if a worker crashes, the message stays in the queue and another worker picks it up. That’s why we use queues like Kafka or RabbitMQ - not just for speed, but for fault tolerance.

Lesson 2: Scalability

You now know how to identify bottlenecks. Scalability is about what you do when you hit one.

Start With the Feeling

Your app just got featured on Product Hunt. Yesterday you had 100 users. Today you have 50,000. Your server is dying.

What do you do?

Most people’s first instinct is - “Make the server bigger.”

That’s valid. But it’s not always the right answer. And sometimes it’s a trap.

The Two Ways to Scale

Vertical Scaling - “Make it bigger”

Buy a more powerful machine. More CPU, more RAM, more storage.

Before:          After:
[Small Server]   [BIG Server]
2 CPU / 4GB RAM  32 CPU / 128GB RAM

Horizontal Scaling - “Add more of it”

Add more machines. Same size, just more of them.

Before:          After:
[Server 1]       [Server 1]
                 [Server 2]
                 [Server 3]

The Real Difference - And Why It Matters

VerticalHorizontal
CostGets exponentially expensiveGrows linearly
LimitHard ceiling - biggest machine has a maxVirtually unlimited
ComplexitySimple - nothing changes in your codeComplex - your code must handle it
Failure riskOne big machine = one big failure pointOne machine dies, others continue
Speed to implementFast - just upgradeSlower - needs architecture changes

The trap with vertical scaling:

Doubling your server size does not double your capacity. But it does double your cost.

At some point, no single machine in the world is big enough. That’s the hard ceiling.

The Hidden Requirement for Horizontal Scaling

Here’s what nobody tells you upfront.

When you have 3 servers handling requests, a new problem appears:

User logs in → hits Server 1 → session stored on Server 1
Next request  → hits Server 2 → "who are you?" ← Server 2 has no session

Your application needs to be stateless to scale horizontally.

Stateless means - the server does not remember anything about you between requests. All state lives somewhere shared - like a database or Redis.

This is why Redis becomes critical at scale. It’s not just a cache - it’s the shared memory for all your servers.

Server 1  ──→  Redis (shared session store)  ←──  Server 2
Server 3  ──→                                ←──  Server 4

Now any server can handle any request. This is called shared nothing architecture.

When Do You Choose Which?

Reach for Vertical first when:

Reach for Horizontal when:

A Real Decision Scenario

You’re building a food delivery app. You have these two parts:

Traffic spikes every evening 7-9pm. What do you scale and how?

Part A - API servers → Scale horizontally. Stateless. Easy to add more. During off-peak hours, scale back down. This is what cloud auto-scaling does.

Part B - Database → Trickier. You can’t just add 5 PostgreSQL servers like you add API servers - they all need the same data. So here you first go vertical. Then later you introduce techniques like read replicas and sharding.

Critical insight:

Different parts of your system scale differently. Your job is to identify which part is under pressure and apply the right scaling strategy to that specific part.

The Bigger Mental Model

Scalability is not a one-time decision. It’s a progression:

  1. Stage 1: One server does everything ← You start here
  2. Stage 2: Vertical scale - bigger server ← Quick fix
  3. Stage 3: Separate concerns - DB on its own server
  4. Stage 4: Horizontal scale - multiple app servers
  5. Stage 5: Caching layer - reduce DB pressure
  6. Stage 6: Database scaling - replicas, sharding

Most companies don’t start at Stage 6. They grow into it. Your job in an interview is to show you understand this progression - and can identify which stage a given system is at and what it needs next.

Exercise 2: News Website

Scenario:
You built a news website. 10 million people visit daily, mostly to read articles. Very few people write articles - maybe 50 editors publishing content. Your database is under massive load and responses are getting slow.

Questions:

  1. Is this a read-heavy or write-heavy system?
  2. Does it make more sense to scale vertically or horizontally here?
  3. What specific bottleneck are you solving?

Analysis:

This is a read-heavy system - 10 million readers vs 50 writers.

The Problem:

Without cache:
10M users → App Servers → Database (10M queries) ← dies

The Solution:

Horizontal scaling for API servers is correct - 10 million users, single server will die. But the real fix is reducing how many requests ever reach the database in the first place.

With cache:
10M users → App Servers → Cache (hit 90% of time) → Database (only 1M queries)

You didn’t make the database faster. You made it less needed.

The Complete Architecture:

10M users

Load Balancer

App Servers (horizontal - 5 to 10 servers)

Redis Cache ← check here first (90% of articles served from here)
    ↓ (cache miss only)
Read Replicas (3 to 4 DB copies handling read traffic)

Primary DB (only handles the 50 editors writing articles)

Read Replicas - A specific horizontal scaling pattern for databases:

All writes → Primary DB
All reads  → Replica 1
            Replica 2
            Replica 3

One database handles all writes. Multiple copies handle all reads. Since 99% of this news site is reads - you distribute that load across replicas.

The Reusable Rule

Every time you see a read-heavy system in any problem, your brain should immediately think:

Cache first. Read replicas second. Primary DB only for writes.

This pattern appears in news sites, social media feeds, product catalogs, Wikipedia - anywhere reads massively outnumber writes.

Key Takeaways

  1. Latency (speed of one request) and Throughput (volume of requests) are different problems requiring different solutions
  2. Always identify the bottleneck - your system is only as fast as its slowest part
  3. Vertical scaling (bigger machine) is fast but has limits; Horizontal scaling (more machines) is unlimited but requires stateless architecture
  4. Different parts of your system scale differently - apply the right strategy to each component
  5. In read-heavy systems: Cache first, Read replicas second, Primary DB only for writes

This is part of a system design fundamentals series. Next up: CAP Theorem and the trade-offs in distributed systems.