Payment Systems — 10k TPS, Database Crashes, and Duplicate Requests

·
system-design payments saga redis interview durability

Three of the most important questions in payment system design. These separate candidates who understand distributed systems from those who just know patterns.


Question 1 — Can the System Handle 10,000 Transactions/Second?

Each transaction involves:

→ Validate payment details
→ Check account balance
→ Debit sender account
→ Credit receiver account
→ Record transaction log
→ Send confirmation

All of these must be ACID guaranteed.
Cannot lose a single transaction.
Cannot double charge.
Cannot credit without debiting.

Where It Breaks

Single PostgreSQL node: ~5,000–10,000 simple queries/second
Each transaction = 4–6 queries
Effective capacity: ~1,000–2,000 TPS

10,000 TPS overwhelms a single node ❌

The Fix — Tiered Architecture

Layer 1 — Redis validation (before DB is touched)

→ Account exists? (Redis cache)
→ Card valid? (Redis cache)
→ Daily limit exceeded? (Redis counter)
→ Rate limit? (Redis)

~80% of validation in Redis → sub-millisecond
Invalid requests filtered before reaching DB ✅

Layer 2 — Horizontal sharding

Shard accounts by account_id:
Shard 1 → accounts 1–25M
Shard 2 → accounts 25M–50M
Shard 3 → accounts 50M–75M
Shard 4 → accounts 75M–100M

10,000 TPS / 4 shards = 2,500 TPS per shard ✅

Layer 3 — Connection pooling

PgBouncer sits between app servers and DB
10,000 concurrent transactions
don't need 10,000 DB connections
Pool of 500 connections handles 10,000 TPS ✅

Layer 4 — Async non-critical work via Kafka

Critical path (synchronous):
→ Debit sender ✅
→ Credit receiver ✅
→ Return confirmation ✅  (<100ms)

Async via Kafka:
→ Email receipt
→ Fraud scoring
→ Analytics
→ Loyalty points
→ Push notification

Question 2 — Database Crashes Mid-Transaction

The Scary Scenarios

Scenario A:
→ Debit sender ✅
→ DB crashes
→ Credit receiver ❌ never happens
→ Money vanished ❌

Scenario B:
→ Debit sender ✅
→ Credit receiver ✅
→ DB crashes before COMMIT
→ Both rolled back — user thinks payment went through ❌

Scenario C:
→ Transaction committed
→ DB crashes before writing to disk
→ Was it committed or not? Nobody knows ❌

Solution 1 — Write-Ahead Log (WAL)

PostgreSQL’s core protection:

Before any data changes:
→ Write intended change to WAL (sequential disk write, fast)
→ Only then apply change to actual data

DB crashes mid-transaction:
→ Restart → read WAL → sees incomplete transaction
→ Rolls back to last consistent state ✅

DB crashes after commit:
→ WAL shows transaction was committed
→ PostgreSQL replays WAL on restart
→ Transaction restored ✅

WAL means the database always knows exactly what happened and can recover to a consistent state.

Solution 2 — Saga Pattern

For cross-shard payments, ACID transactions alone aren’t enough — you can’t do a single ACID transaction across two databases.

A Saga breaks one big transaction into smaller steps, each with a compensating action that undoes it if something fails later.

Step 1: Debit sender (Shard 1)
  → Fail → stop, show error

Step 2: Credit receiver (Shard 2)
  → Fail → COMPENSATE Step 1: refund sender ✅

Step 3: Record transaction log
  → Fail → COMPENSATE Steps 1 & 2 → both accounts restored ✅

Every step has a compensating action:

ActionCompensation
Debit senderRefund sender
Credit receiverDebit receiver back
Send receiptSend correction email

If DB crashes mid-saga:

System restarts
→ Reads saga state from persistent log
→ Knows exactly which step failed
→ Executes compensating actions for completed steps
→ Everything rolled back cleanly ✅
→ No money lost. No inconsistent state.

Solution 3 — Synchronous Replication + Automatic Failover

Primary DB handles all writes
Replica — synchronous replication
→ Transaction only confirmed when BOTH primary AND replica have written it

Primary crashes:
→ Replica promoted automatically
→ Failover: 30–60 seconds
→ Zero data loss ✅

Question 3 — Same Request Hits Server Twice (Network Retry)

The Scenario

User clicks Pay ₹10,000
Request 1 → processes → ₹10,000 debited ✅ → response sent

Network drops before response reaches user
User sees spinner → clicks Pay again (or app auto-retries)

Request 2 → looks like a new transaction → ₹10,000 debited again ❌
User charged twice.

This happens constantly in production — network timeouts, mobile drops, load balancer retries.

Solution — Idempotency Keys

Industry standard. Used by Stripe, Razorpay, PayPal, every serious payment processor.

Client generates a unique key before sending:

idempotencyKey = UUID()  →  "a3f9b2c1-4d5e-6f7a-8b9c"

POST /api/payment
{
  amount: 10000,
  to: "receiver123",
  idempotencyKey: "a3f9b2c1-4d5e-6f7a-8b9c"
}

Server logic:

Request arrives with idempotencyKey

Check Redis: "idempotency:a3f9b2c1-4d5e-6f7a-8b9c" exists?

→ No (first time):
   Process transaction
   Store result in Redis:
     Key: "idempotency:a3f9b2c1..."
     Value: { status: "success", transactionId: "t789" }
     TTL: 24 hours
   Return result ✅

→ Yes (duplicate):
   Don't process again
   Return SAME stored result
   No double charge ✅

Why the client must generate the key (not the server):

If server generates it → client must receive it first
Network drop before client receives key
→ Client retries without key → duplicate possible ❌

Client generates before sending:
→ Key exists regardless of network
→ Same key used for all retries
→ Server deduplicates correctly ✅

Edge Case — Two Identical Requests Arrive Simultaneously

Request 1 arrives → starts processing → not yet done
Request 2 arrives → same key → Redis key not stored yet → thinks new request → double charge ❌

Fix — SET NX as a lock:

Request arrives:
SET "idempotency:key123" "processing" NX EX 30
→ NX: only set if key doesn't exist
→ EX 30: expires in 30 seconds

SET succeeded → this request owns processing → proceed
SET failed    → another request processing → wait → return stored result

Only one request ever processes ✅

Complete Payment Architecture

[Client]
Generate idempotencyKey

[Rate Limiter]
Max 5 attempts/user/minute

[App Server]
Check idempotency key in Redis
→ Duplicate → return cached result immediately
→ New → SET NX lock → proceed

[Redis Validation]
Account exists? Daily limit? Card valid? Fraud score?
Failures rejected here — DB never touched

[Saga Orchestrator]
Step 1: Debit sender (Shard 1)  → WAL first → record step
Step 2: Credit receiver (Shard 2) → WAL first → record step
Step 3: Commit → store idempotency result in Redis

[Primary DB + Synchronous Replica]
Written to both before confirming
Primary crashes → replica promotes → zero data loss

[Kafka — async]
Email, push notification, analytics, fraud analysis, loyalty points

[Response to user: <200ms]

What Happens in Each Failure Scenario

FailureWhat saves it
DB crashes mid-transactionWAL rollback + Saga compensating actions
Network retry sends duplicateIdempotency key → returns cached result
Primary DB goes downReplica promotes; in-flight tx roll back; client retries with same key → processed once on new primary
Saga step fails midwayCompensating actions restore all accounts

The Three Non-Negotiable Principles

Everything in payment systems reduces to these three:

Atomicity — all steps complete or none do → ACID + Saga pattern

Idempotency — same request processed exactly once, regardless of retries → Client-generated idempotency keys + Redis SET NX

Durability — committed transactions survive any failure → WAL + synchronous replication

Everything else — performance, scale, features — is secondary to getting these three right.