Real-time streaming pipeline — Kafka → Spark → Cassandra with checkpoint-based recovery and live Grafana observability
A streaming backend built on Kafka, Spark Structured Streaming, and Cassandra — designed around burst ingestion, checkpoint-based micro-batch recovery, and full JMX → Prometheus → Grafana observability.
Batch pipelines are useless when you need to see a spike the moment it happens.
Event-driven systems generate data at rates that outpace what batch pipelines can handle — a dashboard refreshing every 5 minutes is useless when you need to see a processing spike the moment it happens. The challenge isn't just ingestion speed: it's building a topology that handles variable-rate bursts without backpressure failures, recovers exactly-once across Spark micro-batches on restart, and stays observable end to end so a failure is visible the moment it happens.
Orchestrating Kafka, Spark, and Cassandra into a coherent real-time pipeline with minimal operational overhead — an Airflow DAG driving ingestion and a JMX → Prometheus → Grafana stack making the internals observable — is the core engineering problem.
A single ingestion DAG, checkpointed Spark streaming, and Cassandra tuned for time-series — instrumented end to end.
- A single Airflow DAG fetches events from an external REST API and publishes them to a Kafka topic on a schedule — Airflow chosen over cron for retry, backfill, and per-run visibility. Retention is configured to allow full replay on consumer failure.
- Spark Structured Streaming consumes from Kafka in micro-batches and writes to Cassandra via foreachBatch — the right pattern for a side-effecting sink. State is checkpointed to persistent storage after each batch, so a restart resumes from the last committed Kafka offset rather than dropping or replaying events. Cassandra uses time-based partition keys tuned for the time-series read pattern.
- Every layer exposes metrics: Kafka and Cassandra via JMX exporters, Spark via its metrics sink. Prometheus scrapes them and Grafana renders live panels — msgs/sec, bytes in/out, under-replicated partitions, Cassandra write latency and pending compactions. A Streamlit app surfaces the pipeline-level view: total events, average and peak events/sec, and replication health.
Why these choices — each driven by a concrete failure mode or scale constraint.
- Cassandra for event storage. Write-heavy time-series workload with no complex joins — Cassandra's partition-key model gives linear write scale without the overhead of a relational engine. Time-based partition keys are tuned specifically for the time-series read pattern, keeping read latency flat as the dataset grows.
- Spark Structured Streaming over Kafka Streams. Kafka Streams is JVM-only and handles stateful aggregations over short windows cleanly, but Spark's Python API unlocks the broader ML and data engineering ecosystem. Spark also handles larger stateful windows more naturally and integrates directly with the existing Airflow orchestration layer for checkpoint management.
- Airflow for ingestion orchestration over custom cron. The ingestion job needs retry-on-failure, backfill after an outage, and per-run visibility — the external API returning an error shouldn't silently drop a window of events. Airflow gives that out of the box; cron gives a black box. It's one DAG today, but the model scales to multi-stage dependencies without a rewrite.
- Time-based Cassandra partition keys. A naive key design on a write-heavy time-series workload creates hotspots — all writes land on the same partition until the TTL rotates. Time-bucketed partition keys (hourly or daily depending on throughput) distribute writes across the cluster without manual resharding.
Full Docker Compose stack · real Kafka topics · Grafana panels updating in near real time.
What broke — and what I fixed after a hard look at the failure modes.
- The Spark checkpoint was writing to /tmp — one restart away from correctness loss. An OOM kill or container restart wipes /tmp, so Spark would re-consume from latest (dropping in-flight events) or earliest (duplicating everything) — either way, correctness is gone. The fix: point checkpointLocation at a persistent volume (/opt/spark/checkpoints) so offset state survives a restart. That's what makes the exactly-once-on-restart claim actually true.
- Prometheus counters incremented before the Cassandra write was confirmed — so the dashboard lied. A partial or failed write still bumped the "processed" counter, making the observability layer report healthy throughput while data was silently dropping. The fix: increment metrics only after the write commits, so the panels reflect what actually landed in Cassandra.
- The producer set linger_ms for batching, then blocked on future.get() per message — negating it. Waiting on each send's future serialised the whole producer, so the buffer never filled and throughput stayed low. The fix (in the stress producer): large batches, low linger, and a thread pool of senders — which is how the pipeline reached ~2.8k msg/sec.
- The JMX exporter lumped COUNTER and GAUGE metrics under one pattern, and a Kafka query used a _total suffix that didn't match the exported name — so Grafana panels read empty. Nothing errored; the panels just silently showed nothing until the metric names lined up. The fix: separate the JMX patterns by metric type and align the PromQL to the actual exported metric names.