withsoon

The big picture

Your code → SparkContext → DAG Scheduler → Task Scheduler → Executors

Spark is a distributed compute engine. You write transformations (map, filter, join), Spark figures out how to run them across a cluster.

Driver vs Executors

| Component | What it does | |---|---| | Driver | Runs your main() code, builds the execution plan, coordinates everything | | Executor | Worker process on a node — runs tasks, caches data | | Cluster Manager | Allocates resources (YARN, Kubernetes, Standalone) |

The Driver is a bottleneck for:

collect() — pulls all data to Driver
Broadcast joins — Driver sends the broadcast table
Accumulator updates

RDDs, DataFrames, Datasets

RDD — raw distributed collection, no schema, full control, verbose
DataFrame — distributed table with schema, Catalyst optimizer applies, 10-100x faster than RDD for most tasks
Dataset — typed DataFrame, JVM only (not Python)

Always use DataFrames in Python (PySpark). RDDs bypass Catalyst.

How a job runs: DAG → Stages → Tasks

df.filter(col("age") > 25).groupBy("city").count().show()

Spark builds a DAG (Directed Acyclic Graph) of transformations
DAG Scheduler splits the DAG into stages — boundaries at wide transformations (shuffles)
Each stage runs as parallel tasks, one per partition

Stage 1: filter (narrow — no shuffle)
  → Stage 2: groupBy + count (wide — shuffle needed)
    → show

Shuffles — the expensive part

A shuffle happens when data needs to move between executors (groupBy, join, distinct, repartition).

What happens during a shuffle:

Mapper tasks write shuffle data to local disk (map output files)
Reducer tasks fetch the data over the network
Data is sorted and merged

Shuffle is slow because: disk I/O + network transfer + sort.

How to minimize shuffles

# Bad: two separate shuffles
df.groupBy("country").count()
df.groupBy("country").sum("revenue")

# Good: one shuffle
df.groupBy("country").agg(count("*"), sum("revenue"))

# Bad: shuffle both sides
large_df.join(medium_df, "id")

# Good: broadcast the small side (no shuffle)
from pyspark.sql.functions import broadcast
large_df.join(broadcast(small_df), "id")

Rule of thumb: if one side of a join fits in memory (~100MB), broadcast it.

Partitions

The number of partitions controls parallelism.

# See partition count
df.rdd.getNumPartitions()

# Repartition (triggers shuffle, even distribution)
df.repartition(200)

# Coalesce (no shuffle, reduces partitions only)
df.coalesce(10)

Default shuffle partitions: spark.sql.shuffle.partitions = 200

For small datasets this is way too many. For large datasets it may be too few.

Rule of thumb: target 100–500MB of data per partition.

Common performance problems

| Symptom | Cause | Fix | |---|---|---| | One task takes 10x longer | Data skew — one partition has most data | Salting, AQE | | OOM on Driver | collect() on large data | Use write() instead | | OOM on Executor | Partition too large | Repartition / increase executor memory | | Slow joins | Missing broadcast | Add broadcast() hint | | Thousands of small files | Too many output partitions | Coalesce before write |

Adaptive Query Execution (AQE)

Spark 3.0+ dynamically adjusts the plan at runtime:

Merges small shuffle partitions automatically
Converts joins to broadcast joins if one side is small
Handles skew by splitting large partitions

Enable it: spark.sql.adaptive.enabled = true (default in Spark 3.2+)

Key configs to know

spark.executor.memory=4g
spark.executor.cores=4
spark.sql.shuffle.partitions=200       # tune for your data size
spark.sql.adaptive.enabled=true
spark.sql.adaptive.skewJoin.enabled=true
spark.serializer=org.apache.spark.serializer.KryoSerializer

Spark Architecture — How it Actually Works