withsoon
Home/Interview/System Design: Real-Time Data Pipeline
Interviewadvanced

System Design: Real-Time Data Pipeline

Design a real-time data pipeline that ingests 1M events/second, processes them, and serves analytics โ€” with architecture diagram and trade-off discussion.

๐Ÿ“… 2026-06-05
#system-design#interview#kafka#spark#real-time

The question

Design a real-time data pipeline for an e-commerce platform that processes 1 million order events per second, enriches them with user data, and serves analytics dashboards with under 5 second latency.


Clarifying questions to ask first

Before drawing anything, ask:

  • What's the read pattern? Dashboards (aggregated) or individual event lookup?
  • How long do we need to retain raw events? 7 days? 90 days? Forever?
  • Is exactly-once processing required? (Affects tech choices significantly)
  • What's the acceptable latency? under 1s, under 5s, or near-real-time?
  • Any existing tech stack constraints? AWS, GCP, on-prem?

High-level architecture

Order Service
     โ”‚
     โ–ผ
 Kafka (ingestion)          โ† 1M events/sec, 12+ partitions
     โ”‚
     โ”œโ”€โ”€โ–ถ Flink (real-time) โ”€โ”€โ–ถ Redis (live dashboards, under 1s)
     โ”‚         โ”‚
     โ”‚         โ””โ”€โ”€โ–ถ Postgres (enriched events, recent window)
     โ”‚
     โ””โ”€โ”€โ–ถ S3 / GCS (raw events, retention)
               โ”‚
               โ–ผ
          Spark (batch)   โ”€โ”€โ–ถ ClickHouse / BigQuery (historical analytics)

Component decisions

Ingestion: Kafka

  • Partitions: 1M/sec รท ~100k/sec/partition = 12 partitions minimum. Use 24 for headroom.
  • Replication factor: 3 in production
  • Retention: 7 days (raw events). Long enough for replay if downstream fails.
  • Producers: acks=all, enable.idempotence=true for at-least-once with deduplication downstream

Real-time processing: Flink

Why Flink over Spark Streaming?

  • True streaming (not micro-batch) โ€” lower latency
  • Better exactly-once semantics with stateful operators
  • Native Kafka integration with checkpointing

Flink job:

Kafka source
  โ†’ Parse & validate
  โ†’ Enrich with user data (async I/O to Redis user cache)
  โ†’ Compute 1-min rolling window aggregations
  โ†’ Sink to Redis (live counts) + Postgres (event store)

Serving layer: Redis + ClickHouse

| Layer | Tech | Use case | |---|---|---| | Live (under 1s) | Redis sorted sets | Current order counts, GMV | | Recent (5s-5min) | Postgres | Per-user order history | | Historical | ClickHouse | Dashboard queries, trends |

ClickHouse is the right call for analytics (not Postgres) because:

  • Columnar storage โ†’ 10-100x faster for aggregations
  • Handles billions of rows without breaking a sweat
  • ReplacingMergeTree handles late arrivals / deduplication

Handling the hard parts

Late-arriving events

Events can arrive up to 2 minutes late (network delays, mobile clients).

Solution: watermarks in Flink โ€” allow events up to 2 minutes late:

WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofMinutes(2))

Events after the watermark go to a side output for manual review.

Data skew

Some users place 100x more orders than average. If partitioning by user_id, one partition gets hot.

Solution: partition by order_id (uniform) for throughput. Join with user data using async enrichment, not co-partitioning.

Exactly-once

For financial data (orders), use:

  1. Idempotent Kafka producer
  2. Flink exactly-once with checkpointing to S3
  3. Idempotent sink (upsert by order_id)

The trade-off: exactly-once adds ~20-30% latency overhead. For non-financial events, at-least-once + idempotent sinks is usually fine.


Estimating capacity

1M events/sec ร— 500 bytes/event = 500 MB/sec throughput

Kafka storage:
  500 MB/sec ร— 86400 sec/day ร— 7 days = 302 TB
  With 3x replication = ~900 TB

Flink:
  12 partitions โ†’ 12 parallel tasks
  Each task processes ~83k events/sec โ€” manageable with 4 cores/2GB

Trade-offs to discuss

| Decision | Alternative | Trade-off | |---|---|---| | Flink for real-time | Spark Structured Streaming | Flink lower latency, Spark easier if team knows it | | ClickHouse for analytics | BigQuery | ClickHouse cheaper self-hosted, BigQuery zero-ops | | Redis for live layer | Materialize | Redis simpler, Materialize has SQL semantics | | Kafka as backbone | Kinesis | Kafka more control, Kinesis zero-ops on AWS |


What a strong answer includes

  • Clarifying questions before drawing anything
  • Numbers: partition count, storage estimates, latency budget
  • Trade-off discussion โ€” not just "I'd use Kafka" but "why Kafka over X"
  • Failure scenarios: what happens if Flink crashes mid-job?
  • Monitoring: consumer lag, Flink checkpoint duration, Redis memory