Case Study

Payment System

Distributed payment backbone with event-driven orchestration, real-time fraud scoring, and multi-layer financial analytics across the full transaction lifecycle.

Kafka Spark Airflow dbt BigQuery PostgreSQL Redis GCP XGBoost Docker

A payments pipeline,
built for failure.

2025 FINTECH · DATA ENGINEERING KAFKA · SPARK · XGBOOST · GCP

Event-driven payment pipeline — idempotent ingestion, real-time fraud scoring, double-entry ledger

A production-grade payment backend built on Kafka, Spark Structured Streaming, and XGBoost — designed around Stripe's retry semantics, exactly-once ledger writes, and crash-safe consumer recovery.

~9ms
p99 XGBoost fraud scoring — 10× under SLA
7
Kafka topics across pipeline
10
dbt models (staging → marts)
3
Cloud Run services on GCP

Stripe retries every failed webhook (up to 3 days in live mode). Naive handlers double-charge.

Payment backends face three hard constraints simultaneously. First: Stripe's retry semantics mean a webhook can arrive multiple times — a naive handler processes each one, double-charging customers and corrupting the ledger. Second: fraud scoring must happen before settlement, but adding ML inference to the hot path kills Stripe's 30-second webhook timeout. Third: the ledger must be exactly-once even when Kafka consumers crash mid-batch.

Building a system that handles Stripe's retry semantics, single-digit-millisecond fraud scoring, and crash-safe double-entry bookkeeping — simultaneously, without any of them compromising the others — is the engineering problem.

Decouple, gate at the edge, score asynchronously, enforce at commit.

  • The webhook receiver does three things synchronously: verify the HMAC signature, check Redis for the event ID via atomic SET NX with a 24h TTL matching Stripe's retry window, and publish to Kafka. If the event ID already exists in Redis, return 200 immediately — no reprocessing. Hot path returns before any downstream component sees the event.
  • Spark Structured Streaming computes 8 ML features asynchronously (tx_velocity_1m, amount_zscore, merchant_risk_score, device_switch_flag, etc.) and caches them in Redis. The scoring consumer reads those features and calls the XGBoost model at p99 ≈ 9 ms — 10× under the 100 ms SLA, and completely off the webhook acknowledgment path.
  • The ledger consumer writes a DEBIT + CREDIT pair to PostgreSQL in one transaction. A DEFERRABLE INITIALLY DEFERRED trigger enforces SUM(amount_cents) = 0 per transaction_id at commit time — both rows can land before the balance constraint fires.
  • Airflow reconciles the ledger nightly against the Stripe Settlements API, paginates through all events, and exports discrepancies to BigQuery. dbt builds 10 models on top — staging → dimensions → facts → marts for reconciliation, revenue, and chargeback reporting.

Why these tradeoffs — each came from a concrete failure mode, not preference.

  • Manual Kafka offset commit (enable.auto.commit=False). Auto-commit acknowledges messages before processing completes. A consumer crash after auto-commit but before the ledger write silently drops the event. Manual commit only after successful DB write makes the consumer crash-safe — Kafka replays from the last committed offset on restart.
  • Redis SET NX for idempotency at the edge, not in the DB. Checking for duplicates at the DB layer requires a round-trip on every request. Redis atomic SET NX is sub-millisecond and happens before Kafka publish — duplicates never enter the pipeline at all.
  • DEFERRABLE INITIALLY DEFERRED trigger for ledger balance. A non-deferred constraint fires after each INSERT — always fails because only one of DEBIT or CREDIT has landed. Deferring to commit time allows both rows in a single transaction before the balance check fires.
  • XGBoost binary (38KB) in version control. Eliminates runtime training dependencies. The model is regenerable from training scripts if retraining is needed, but serving doesn't require a training environment.
  • ML fallback is manual_review=True, not a default verdict. When Spark features aren't yet cached — they arrive ~10s post-event — the scoring consumer can't wait. Falling back to manual review is the correct conservative failure, not an error condition.

GCP deployment · Cloud Run + Cloud SQL + Cloud Memorystore · real Stripe test events.

~9ms
p99 XGBoost scoring — 10× under 100ms SLA
7
Kafka topics with DLQ coverage
10
dbt models across 4 layers
3
Cloud Run services in prod

What actually broke — and what I learned from fixing it.

  • Spark features arrive ~10s after the Kafka event. An early version of the scoring consumer queried Redis immediately and got a cache miss — then fell back to scoring with all-zero features, producing low risk scores for every transaction. The fix: fall back to manual_review=True on cache miss, not a default score. A false-clean verdict in a payments context is categorically worse than a false escalation.
  • The DEFERRABLE trigger only works when both rows are in the same transaction. Early versions opened one transaction per ledger row. The balance constraint fired after the first INSERT and always failed. The fix required wrapping both INSERTs in an explicit transaction with DEFERRABLE INITIALLY DEFERRED — the constraint evaluates only once, at commit.
  • A Kafka offset-commit bug shipped in three consumers at once. store_offsets() silently fails with KafkaError _INVALID_ARG unless enable.auto.offset.store is set to False — the commit errored on every single poll. Caught it when the logs filled with commit errors, then fixed it across the validation, scoring, and ledger consumers.

How transactions
actually flow.

payment-system · v1.0 hover any node →
webhook publish score ledger SET NX features nightly reconcile
Stripe Events
webhook · HMAC signed
Redis · idempotency
SET NX · 24h TTL
Airflow DAG
nightly reconciliation
webhook-service
FastAPI · Cloud Run
Kafka · 7 topics
enable.auto.commit=False
Spark Feature Engine
8 features → Redis
XGBoost Scoring
p99 ≈ 9ms
PostgreSQL · ledger
append-only · DEFERRABLE
BigQuery + dbt
10 models · 4 layers
ingest
store
serve
— hot path animated · cold path dashed

Send a payment.
Watch the pipeline.

payment-system / webhook → ledger

Each payment fires a Stripe-style webhook event. The pipeline verifies HMAC, checks Redis for duplicates, scores for fraud, and writes a balanced double-entry ledger row.

01 KAFKA · webhook.received
02 LEDGER · postgres
transactions processed0
fraud flagged0
duplicates blocked0
total volume$0.00
How it works: Stripe webhook → HMAC verify → Redis SET NX (idempotency) → Kafka publish. XGBoost scores the transaction. Ledger consumer writes DEBIT + CREDIT in one transaction. Duplicate event_ids are blocked at the Redis gate before entering Kafka.
CONNECT