System Design Message Queues: Decoupling Services at Scale

Open Table of contents

Introduction
The Problem Queues Solve
What a Message Queue Actually Is
Why Decoupling Matters
Two Messaging Models
- Point to Point
- Pub/Sub (Publish Subscribe)
Kafka vs RabbitMQ
Key Concepts
Connections to Previous Lessons
Exercise: Instagram Notifications
Key Takeaways

Introduction

Without queues, services depend on each other to be alive at the same time. With queues, they only need to share a channel. This single shift explains how large systems stay resilient under load.

The Problem Queues Solve

User clicks “Place Order.” System needs to confirm the order, send email, send SMS, notify warehouse, update inventory, calculate loyalty points, log analytics.

Without a queue — synchronous:

Save to DB (50ms) → Send email (200ms) → Send SMS (150ms)
→ Notify warehouse (100ms) → Update inventory (80ms)
→ Calculate loyalty (120ms) → Log analytics (90ms)
→ Return "Order Confirmed"

Total: ~790ms — user waits for all of this

If SMS service is down → entire order fails.

With a queue — asynchronous:

Save to DB (50ms) → drop message in queue (5ms)
→ Return "Order Confirmed" ✅

Total: ~55ms

Meanwhile in background:
Queue → Email Service
Queue → SMS Service
Queue → Warehouse Service
Queue → Inventory Service
Queue → Analytics Service

User gets instant confirmation. Everything else runs in the background. If SMS is down — message stays in queue and retries automatically.

What a Message Queue Actually Is

Producer → puts message in queue → Consumer picks it up

Producer: generates work
Queue: holds work until someone is ready
Consumer: processes the work

The producer and consumer never talk directly. They only know about the queue. This is decoupling.

Why Decoupling Matters

Without a queue — tightly coupled:

Order Service → directly calls Email Service
             → directly calls SMS Service
             → directly calls Warehouse Service

If Email Service is down → Order Service breaks
If Warehouse is slow → Order Service slows
Adding a new service → must modify Order Service

With a queue — decoupled:

Order Service → Queue ← Email Service
                      ← SMS Service
                      ← Warehouse Service
                      ← Any new service added tomorrow

Email Service down → queue holds messages, retries later
Warehouse slow → queue absorbs the backlog
Adding new service → just subscribe to queue, nothing else changes

Each service lives and dies independently.

A queue converts a hard dependency into a soft dependency. Services no longer need each other alive at the same time.

Two Messaging Models

Point to Point

One producer. One consumer processes each message. No message is processed twice.

Order placed → Queue → Worker 1 processes it
                     → Worker 2 processes next
                     → Worker 3 processes next

Use when: Exactly one thing should happen per event — payment processing, order fulfillment, video compression.

One producer. Multiple independent consumers each receive every message.

Order placed → Topic → Email Service receives it
                     → SMS Service receives it
                     → Analytics Service receives it
                     → Warehouse Service receives it

Use when: Multiple services need to react to the same event independently.

Kafka vs RabbitMQ

RabbitMQ — Smart Broker

The broker handles routing, retries, delivery guarantees. Messages are deleted after consumption.

Push-based — broker pushes to consumers
Messages deleted after acknowledgement
Complex routing rules built in
Throughput: hundreds of thousands/sec

Best for: Task queues, job processing, background jobs, scheduled work.

Kafka — Distributed Log

Kafka is not really a queue — it’s a distributed log. Messages are written to a log and stay there. Consumers read from wherever they left off.

Producer → Kafka Log → Consumer A reads from position 100
                     → Consumer B reads from position 847
                     → Consumer C replays from position 1

Messages are NOT deleted after consumption. They stay for a configured retention period.

Pull-based — consumers pull at their own pace
Messages retained after consumption — replay is possible
Ordering guaranteed within a partition
Throughput: millions/sec

Best for: Event streaming, analytics pipelines, audit logs, anything needing replay.

The Decision Rule

"One task, done once, by one worker"
→ RabbitMQ

"Many systems react to the same event, replay might be needed"
→ Kafka

Key Concepts

Acknowledgement

Consumer tells the queue “I processed this successfully.” No acknowledgement = queue retries.

Consumer picks up message
→ Processes it
→ Sends ACK ✅ → queue deletes message
→ Crashes before ACK ❌ → queue retries with another consumer

Dead Letter Queue (DLQ)

Message fails repeatedly → moves to DLQ for investigation instead of blocking the queue or disappearing silently.

Message fails 3 times → moves to DLQ
                      → engineers investigate
                      → fix bug → replay from DLQ

Consumer Groups (Kafka)

Multiple consumers split partitions between them for parallel processing.

Kafka topic: 6 partitions
Consumer group: 3 consumers
→ Consumer 1 handles partitions 1-2
→ Consumer 2 handles partitions 3-4
→ Consumer 3 handles partitions 5-6

Add more consumers to a group → more throughput.

Connections to Previous Lessons

Flash sale from Lesson 6:

Price changes → Kafka topic
10M users connected via WebSocket ← subscribed to topic
Price update published once → all users notified
Zero polling. Zero DB flood.

Video processing from Lesson 1:

500 videos/hour → each upload drops message in RabbitMQ
→ Worker pool processes in parallel
→ Worker crashes → message retried automatically

Order placement with polyglot persistence:

Order saved to PostgreSQL (ACID guaranteed)
→ Event published to Kafka
→ Email, SMS, Inventory, Analytics all react independently

Exercise: Instagram Notifications

Events: photo likes, post comments, new follows, story mentions, weekly activity summary.

For each: RabbitMQ or Kafka? Point-to-point or Pub/Sub? What happens when notification service is down?

Reference answer:

Event	Tool	Model	Reasoning
Likes	Kafka	Pub/Sub	Notifications + analytics + feed algorithm all react; extreme throughput
Comments	Kafka	Pub/Sub	Same — multiple consumers, high volume
Follows	Kafka	Pub/Sub	Notification + recommendations + feed all react
Story mentions	Kafka	Pub/Sub	Notifications + moderation + analytics
Weekly summary	RabbitMQ	Point-to-Point	One job, one email, scheduled, predictable volume

If notification service is down: messages stay in queue until it recovers. Service restarts and processes the backlog. Repeated failures go to Dead Letter Queue for investigation.

Key Takeaways

Queues decouple services — producer and consumer don’t need to be alive simultaneously.
Point-to-point (RabbitMQ) for task processing; Pub/Sub (Kafka) for event-driven reactions.
Kafka retains messages — consumers can replay history. RabbitMQ deletes after ACK.
Acknowledgement + Dead Letter Queue = nothing is silently lost.
Social/event-driven systems are almost always Kafka. Background job systems are often RabbitMQ.

Part of the system design series. Next: CDN — how global apps serve content fast everywhere.