Payment Systems — 10k TPS, Database Crashes, and Duplicate Requests

Three of the most important questions in payment system design. These separate candidates who understand distributed systems from those who just know patterns.

Question 1 — Can the System Handle 10,000 Transactions/Second?

Each transaction involves:

→ Validate payment details
→ Check account balance
→ Debit sender account
→ Credit receiver account
→ Record transaction log
→ Send confirmation

All of these must be ACID guaranteed.
Cannot lose a single transaction.
Cannot double charge.
Cannot credit without debiting.

Where It Breaks

Single PostgreSQL node: ~5,000–10,000 simple queries/second
Each transaction = 4–6 queries
Effective capacity: ~1,000–2,000 TPS

10,000 TPS overwhelms a single node ❌

The Fix — Tiered Architecture

Layer 1 — Redis validation (before DB is touched)

→ Account exists? (Redis cache)
→ Card valid? (Redis cache)
→ Daily limit exceeded? (Redis counter)
→ Rate limit? (Redis)

~80% of validation in Redis → sub-millisecond
Invalid requests filtered before reaching DB ✅

Layer 2 — Horizontal sharding

Shard accounts by account_id:
Shard 1 → accounts 1–25M
Shard 2 → accounts 25M–50M
Shard 3 → accounts 50M–75M
Shard 4 → accounts 75M–100M

10,000 TPS / 4 shards = 2,500 TPS per shard ✅

Layer 3 — Connection pooling

PgBouncer sits between app servers and DB
10,000 concurrent transactions
don't need 10,000 DB connections
Pool of 500 connections handles 10,000 TPS ✅

Layer 4 — Async non-critical work via Kafka

Critical path (synchronous):
→ Debit sender ✅
→ Credit receiver ✅
→ Return confirmation ✅  (<100ms)

Async via Kafka:
→ Email receipt
→ Fraud scoring
→ Analytics
→ Loyalty points
→ Push notification

Question 2 — Database Crashes Mid-Transaction

The Scary Scenarios

Scenario A:
→ Debit sender ✅
→ DB crashes
→ Credit receiver ❌ never happens
→ Money vanished ❌

Scenario B:
→ Debit sender ✅
→ Credit receiver ✅
→ DB crashes before COMMIT
→ Both rolled back — user thinks payment went through ❌

Scenario C:
→ Transaction committed
→ DB crashes before writing to disk
→ Was it committed or not? Nobody knows ❌

Solution 1 — Write-Ahead Log (WAL)

PostgreSQL’s core protection:

Before any data changes:
→ Write intended change to WAL (sequential disk write, fast)
→ Only then apply change to actual data

DB crashes mid-transaction:
→ Restart → read WAL → sees incomplete transaction
→ Rolls back to last consistent state ✅

DB crashes after commit:
→ WAL shows transaction was committed
→ PostgreSQL replays WAL on restart
→ Transaction restored ✅

WAL means the database always knows exactly what happened and can recover to a consistent state.

Solution 2 — Saga Pattern

For cross-shard payments, ACID transactions alone aren’t enough — you can’t do a single ACID transaction across two databases.

A Saga breaks one big transaction into smaller steps, each with a compensating action that undoes it if something fails later.

Step 1: Debit sender (Shard 1)
  → Fail → stop, show error

Step 2: Credit receiver (Shard 2)
  → Fail → COMPENSATE Step 1: refund sender ✅

Step 3: Record transaction log
  → Fail → COMPENSATE Steps 1 & 2 → both accounts restored ✅

Every step has a compensating action:

Action	Compensation
Debit sender	Refund sender
Credit receiver	Debit receiver back
Send receipt	Send correction email

If DB crashes mid-saga:

System restarts
→ Reads saga state from persistent log
→ Knows exactly which step failed
→ Executes compensating actions for completed steps
→ Everything rolled back cleanly ✅
→ No money lost. No inconsistent state.

Solution 3 — Synchronous Replication + Automatic Failover

Primary DB handles all writes
Replica — synchronous replication
→ Transaction only confirmed when BOTH primary AND replica have written it

Primary crashes:
→ Replica promoted automatically
→ Failover: 30–60 seconds
→ Zero data loss ✅

Question 3 — Same Request Hits Server Twice (Network Retry)

The Scenario

User clicks Pay ₹10,000
Request 1 → processes → ₹10,000 debited ✅ → response sent

Network drops before response reaches user
User sees spinner → clicks Pay again (or app auto-retries)

Request 2 → looks like a new transaction → ₹10,000 debited again ❌
User charged twice.

This happens constantly in production — network timeouts, mobile drops, load balancer retries.

Solution — Idempotency Keys

Industry standard. Used by Stripe, Razorpay, PayPal, every serious payment processor.

Client generates a unique key before sending:

idempotencyKey = UUID()  →  "a3f9b2c1-4d5e-6f7a-8b9c"

POST /api/payment
{
  amount: 10000,
  to: "receiver123",
  idempotencyKey: "a3f9b2c1-4d5e-6f7a-8b9c"
}

Server logic:

Request arrives with idempotencyKey

Check Redis: "idempotency:a3f9b2c1-4d5e-6f7a-8b9c" exists?

→ No (first time):
   Process transaction
   Store result in Redis:
     Key: "idempotency:a3f9b2c1..."
     Value: { status: "success", transactionId: "t789" }
     TTL: 24 hours
   Return result ✅

→ Yes (duplicate):
   Don't process again
   Return SAME stored result
   No double charge ✅

Why the client must generate the key (not the server):

If server generates it → client must receive it first
Network drop before client receives key
→ Client retries without key → duplicate possible ❌

Client generates before sending:
→ Key exists regardless of network
→ Same key used for all retries
→ Server deduplicates correctly ✅

Edge Case — Two Identical Requests Arrive Simultaneously

Request 1 arrives → starts processing → not yet done
Request 2 arrives → same key → Redis key not stored yet → thinks new request → double charge ❌

Fix — SET NX as a lock:

Request arrives:
SET "idempotency:key123" "processing" NX EX 30
→ NX: only set if key doesn't exist
→ EX 30: expires in 30 seconds

SET succeeded → this request owns processing → proceed
SET failed    → another request processing → wait → return stored result

Only one request ever processes ✅

Complete Payment Architecture

[Client]
Generate idempotencyKey
      ↓
[Rate Limiter]
Max 5 attempts/user/minute
      ↓
[App Server]
Check idempotency key in Redis
→ Duplicate → return cached result immediately
→ New → SET NX lock → proceed
      ↓
[Redis Validation]
Account exists? Daily limit? Card valid? Fraud score?
Failures rejected here — DB never touched
      ↓
[Saga Orchestrator]
Step 1: Debit sender (Shard 1)  → WAL first → record step
Step 2: Credit receiver (Shard 2) → WAL first → record step
Step 3: Commit → store idempotency result in Redis
      ↓
[Primary DB + Synchronous Replica]
Written to both before confirming
Primary crashes → replica promotes → zero data loss
      ↓
[Kafka — async]
Email, push notification, analytics, fraud analysis, loyalty points
      ↓
[Response to user: <200ms]

What Happens in Each Failure Scenario

Failure	What saves it
DB crashes mid-transaction	WAL rollback + Saga compensating actions
Network retry sends duplicate	Idempotency key → returns cached result
Primary DB goes down	Replica promotes; in-flight tx roll back; client retries with same key → processed once on new primary
Saga step fails midway	Compensating actions restore all accounts

The Three Non-Negotiable Principles

Everything in payment systems reduces to these three:

Atomicity — all steps complete or none do → ACID + Saga pattern

Idempotency — same request processed exactly once, regardless of retries → Client-generated idempotency keys + Redis SET NX

Durability — committed transactions survive any failure → WAL + synchronous replication

Everything else — performance, scale, features — is secondary to getting these three right.