Case Study

Real-Time Data Streaming

End-to-end streaming pipeline ingesting events through Kafka, processing via Spark Structured Streaming, persisting to Cassandra, and surfacing live metrics through Prometheus + Grafana.

Apache Kafka Apache Spark Cassandra Airflow PostgreSQL Grafana Docker

Streaming pipeline,
instrumented end to end.

2025 DATA ENGINEERING · STREAMING KAFKA · SPARK · CASSANDRA · AIRFLOW

Real-time streaming pipeline — Kafka → Spark → Cassandra with checkpoint-based recovery and live Grafana observability

A streaming backend built on Kafka, Spark Structured Streaming, and Cassandra — designed around burst ingestion, checkpoint-based micro-batch recovery, and full JMX → Prometheus → Grafana observability.

500K
messages stress-tested through Kafka
2.8k
msg/sec peak throughput (measured)
1
Airflow DAG — API → Kafka ingestion
0
data loss on Spark restart (persistent checkpoint)

Batch pipelines are useless when you need to see a spike the moment it happens.

Event-driven systems generate data at rates that outpace what batch pipelines can handle — a dashboard refreshing every 5 minutes is useless when you need to see a processing spike the moment it happens. The challenge isn't just ingestion speed: it's building a topology that handles variable-rate bursts without backpressure failures, recovers exactly-once across Spark micro-batches on restart, and stays observable end to end so a failure is visible the moment it happens.

Orchestrating Kafka, Spark, and Cassandra into a coherent real-time pipeline with minimal operational overhead — an Airflow DAG driving ingestion and a JMX → Prometheus → Grafana stack making the internals observable — is the core engineering problem.

A single ingestion DAG, checkpointed Spark streaming, and Cassandra tuned for time-series — instrumented end to end.

  • A single Airflow DAG fetches events from an external REST API and publishes them to a Kafka topic on a schedule — Airflow chosen over cron for retry, backfill, and per-run visibility. Retention is configured to allow full replay on consumer failure.
  • Spark Structured Streaming consumes from Kafka in micro-batches and writes to Cassandra via foreachBatch — the right pattern for a side-effecting sink. State is checkpointed to persistent storage after each batch, so a restart resumes from the last committed Kafka offset rather than dropping or replaying events. Cassandra uses time-based partition keys tuned for the time-series read pattern.
  • Every layer exposes metrics: Kafka and Cassandra via JMX exporters, Spark via its metrics sink. Prometheus scrapes them and Grafana renders live panels — msgs/sec, bytes in/out, under-replicated partitions, Cassandra write latency and pending compactions. A Streamlit app surfaces the pipeline-level view: total events, average and peak events/sec, and replication health.

Why these choices — each driven by a concrete failure mode or scale constraint.

  • Cassandra for event storage. Write-heavy time-series workload with no complex joins — Cassandra's partition-key model gives linear write scale without the overhead of a relational engine. Time-based partition keys are tuned specifically for the time-series read pattern, keeping read latency flat as the dataset grows.
  • Spark Structured Streaming over Kafka Streams. Kafka Streams is JVM-only and handles stateful aggregations over short windows cleanly, but Spark's Python API unlocks the broader ML and data engineering ecosystem. Spark also handles larger stateful windows more naturally and integrates directly with the existing Airflow orchestration layer for checkpoint management.
  • Airflow for ingestion orchestration over custom cron. The ingestion job needs retry-on-failure, backfill after an outage, and per-run visibility — the external API returning an error shouldn't silently drop a window of events. Airflow gives that out of the box; cron gives a black box. It's one DAG today, but the model scales to multi-stage dependencies without a rewrite.
  • Time-based Cassandra partition keys. A naive key design on a write-heavy time-series workload creates hotspots — all writes land on the same partition until the TTL rotates. Time-bucketed partition keys (hourly or daily depending on throughput) distribute writes across the cluster without manual resharding.

Full Docker Compose stack · real Kafka topics · Grafana panels updating in near real time.

500K
messages stress-tested through Kafka
2.8k
msg/sec peak throughput (measured)
1
Airflow DAG — API → Kafka ingestion
0
messages lost on Spark restart

What broke — and what I fixed after a hard look at the failure modes.

  • The Spark checkpoint was writing to /tmp — one restart away from correctness loss. An OOM kill or container restart wipes /tmp, so Spark would re-consume from latest (dropping in-flight events) or earliest (duplicating everything) — either way, correctness is gone. The fix: point checkpointLocation at a persistent volume (/opt/spark/checkpoints) so offset state survives a restart. That's what makes the exactly-once-on-restart claim actually true.
  • Prometheus counters incremented before the Cassandra write was confirmed — so the dashboard lied. A partial or failed write still bumped the "processed" counter, making the observability layer report healthy throughput while data was silently dropping. The fix: increment metrics only after the write commits, so the panels reflect what actually landed in Cassandra.
  • The producer set linger_ms for batching, then blocked on future.get() per message — negating it. Waiting on each send's future serialised the whole producer, so the buffer never filled and throughput stayed low. The fix (in the stress producer): large batches, low linger, and a thread pool of senders — which is how the pipeline reached ~2.8k msg/sec.
  • The JMX exporter lumped COUNTER and GAUGE metrics under one pattern, and a Kafka query used a _total suffix that didn't match the exported name — so Grafana panels read empty. Nothing errored; the panels just silently showed nothing until the metric names lined up. The fix: separate the JMX patterns by metric type and align the PromQL to the actual exported metric names.

How events
actually flow.

real-time-data-streaming · v1.0 hover any node →
events consume stream Grafana ingest DAG checkpoint metrics query
External API
REST event source
Airflow · 1 DAG
API → Kafka ingestion
Checkpoint Store
persistent · restart-safe
Kafka Topic
user_created · partitioned
Spark Streaming
2.8k msg/sec · foreachBatch
Prometheus
JMX + Spark scrape
Cassandra · time-series
time-partitioned writes
Grafana dashboards
real-time observability
Streamlit dashboard
pipeline-level metrics
ingest
store
serve
— hot path animated · cold path dashed

Produce events.
Watch the pipeline.

real-time-data-streaming / kafka → cassandra

Events flow from the producer into Kafka, Spark picks them up in micro-batches, and processed records land in Cassandra. Spike load to see backpressure build and recover.

01 KAFKA · topic partition
02 CASSANDRA · time-series write
events produced0
Spark processed0
consumer lag0
Cassandra writes0
How it works: Airflow DAG fetches events → Kafka partition. Spark micro-batch consumes, applies window aggregations, checkpoints state, writes to Cassandra with time-based partition key. Grafana reads from Cassandra in near real time.
CONNECT