Payment System

2025 FINTECH · DATA ENGINEERING KAFKA · SPARK · XGBOOST · GCP

Event-driven payment pipeline — idempotent ingestion, real-time fraud scoring, double-entry ledger

A production-grade payment backend built on Kafka, Spark Structured Streaming, and XGBoost — designed around Stripe's retry semantics, exactly-once ledger writes, and crash-safe consumer recovery.

~9ms

p99 XGBoost fraud scoring — 10× under SLA

Kafka topics across pipeline

dbt models (staging → marts)

Cloud Run services on GCP

Stripe retries every failed webhook (up to 3 days in live mode). Naive handlers double-charge.

Payment backends face three hard constraints simultaneously. First: Stripe's retry semantics mean a webhook can arrive multiple times — a naive handler processes each one, double-charging customers and corrupting the ledger. Second: fraud scoring must happen before settlement, but adding ML inference to the hot path kills Stripe's 30-second webhook timeout. Third: the ledger must be exactly-once even when Kafka consumers crash mid-batch.

Building a system that handles Stripe's retry semantics, single-digit-millisecond fraud scoring, and crash-safe double-entry bookkeeping — simultaneously, without any of them compromising the others — is the engineering problem.

Decouple, gate at the edge, score asynchronously, enforce at commit.

The webhook receiver does three things synchronously: verify the HMAC signature, check Redis for the event ID via atomic SET NX with a 24h TTL matching Stripe's retry window, and publish to Kafka. If the event ID already exists in Redis, return 200 immediately — no reprocessing. Hot path returns before any downstream component sees the event.
Spark Structured Streaming computes 8 ML features asynchronously (tx_velocity_1m, amount_zscore, merchant_risk_score, device_switch_flag, etc.) and caches them in Redis. The scoring consumer reads those features and calls the XGBoost model at p99 ≈ 9 ms — 10× under the 100 ms SLA, and completely off the webhook acknowledgment path.
The ledger consumer writes a DEBIT + CREDIT pair to PostgreSQL in one transaction. A DEFERRABLE INITIALLY DEFERRED trigger enforces SUM(amount_cents) = 0 per transaction_id at commit time — both rows can land before the balance constraint fires.
Airflow reconciles the ledger nightly against the Stripe Settlements API, paginates through all events, and exports discrepancies to BigQuery. dbt builds 10 models on top — staging → dimensions → facts → marts for reconciliation, revenue, and chargeback reporting.

Why these tradeoffs — each came from a concrete failure mode, not preference.

Manual Kafka offset commit (enable.auto.commit=False). Auto-commit acknowledges messages before processing completes. A consumer crash after auto-commit but before the ledger write silently drops the event. Manual commit only after successful DB write makes the consumer crash-safe — Kafka replays from the last committed offset on restart.
Redis SET NX for idempotency at the edge, not in the DB. Checking for duplicates at the DB layer requires a round-trip on every request. Redis atomic SET NX is sub-millisecond and happens before Kafka publish — duplicates never enter the pipeline at all.
DEFERRABLE INITIALLY DEFERRED trigger for ledger balance. A non-deferred constraint fires after each INSERT — always fails because only one of DEBIT or CREDIT has landed. Deferring to commit time allows both rows in a single transaction before the balance check fires.
XGBoost binary (38KB) in version control. Eliminates runtime training dependencies. The model is regenerable from training scripts if retraining is needed, but serving doesn't require a training environment.
ML fallback is manual_review=True, not a default verdict. When Spark features aren't yet cached — they arrive ~10s post-event — the scoring consumer can't wait. Falling back to manual review is the correct conservative failure, not an error condition.

GCP deployment · Cloud Run + Cloud SQL + Cloud Memorystore · real Stripe test events.

~9ms

p99 XGBoost scoring — 10× under 100ms SLA

Kafka topics with DLQ coverage

dbt models across 4 layers

Cloud Run services in prod

What actually broke — and what I learned from fixing it.

Spark features arrive ~10s after the Kafka event. An early version of the scoring consumer queried Redis immediately and got a cache miss — then fell back to scoring with all-zero features, producing low risk scores for every transaction. The fix: fall back to manual_review=True on cache miss, not a default score. A false-clean verdict in a payments context is categorically worse than a false escalation.
The DEFERRABLE trigger only works when both rows are in the same transaction. Early versions opened one transaction per ledger row. The balance constraint fired after the first INSERT and always failed. The fix required wrapping both INSERTs in an explicit transaction with DEFERRABLE INITIALLY DEFERRED — the constraint evaluates only once, at commit.
A Kafka offset-commit bug shipped in three consumers at once. store_offsets() silently fails with KafkaError _INVALID_ARG unless enable.auto.offset.store is set to False — the commit errored on every single poll. Caught it when the logs filled with commit errors, then fixed it across the validation, scoring, and ledger consumers.

A payments pipeline,
built for failure.

Event-driven payment pipeline — idempotent ingestion, real-time fraud scoring, double-entry ledger

Stripe retries every failed webhook (up to 3 days in live mode). Naive handlers double-charge.

Decouple, gate at the edge, score asynchronously, enforce at commit.

Why these tradeoffs — each came from a concrete failure mode, not preference.

GCP deployment · Cloud Run + Cloud SQL + Cloud Memorystore · real Stripe test events.

What actually broke — and what I learned from fixing it.

How transactions
actually flow.

Send a payment.
Watch the pipeline.

payment-system / webhook → ledger

Payment System

A payments pipeline,built for failure.

Event-driven payment pipeline — idempotent ingestion, real-time fraud scoring, double-entry ledger

Stripe retries every failed webhook (up to 3 days in live mode). Naive handlers double-charge.

Decouple, gate at the edge, score asynchronously, enforce at commit.

Why these tradeoffs — each came from a concrete failure mode, not preference.

GCP deployment · Cloud Run + Cloud SQL + Cloud Memorystore · real Stripe test events.

What actually broke — and what I learned from fixing it.

How transactionsactually flow.

Send a payment.Watch the pipeline.

payment-system / webhook → ledger

A payments pipeline,
built for failure.

How transactions
actually flow.

Send a payment.
Watch the pipeline.