The big picture
Your code → SparkContext → DAG Scheduler → Task Scheduler → Executors
Spark is a distributed compute engine. You write transformations (map, filter, join), Spark figures out how to run them across a cluster.
Driver vs Executors
| Component | What it does | |---|---| | Driver | Runs your main() code, builds the execution plan, coordinates everything | | Executor | Worker process on a node — runs tasks, caches data | | Cluster Manager | Allocates resources (YARN, Kubernetes, Standalone) |
The Driver is a bottleneck for:
collect()— pulls all data to Driver- Broadcast joins — Driver sends the broadcast table
- Accumulator updates
RDDs, DataFrames, Datasets
- RDD — raw distributed collection, no schema, full control, verbose
- DataFrame — distributed table with schema, Catalyst optimizer applies, 10-100x faster than RDD for most tasks
- Dataset — typed DataFrame, JVM only (not Python)
Always use DataFrames in Python (PySpark). RDDs bypass Catalyst.
How a job runs: DAG → Stages → Tasks
df.filter(col("age") > 25).groupBy("city").count().show()
- Spark builds a DAG (Directed Acyclic Graph) of transformations
- DAG Scheduler splits the DAG into stages — boundaries at wide transformations (shuffles)
- Each stage runs as parallel tasks, one per partition
Stage 1: filter (narrow — no shuffle)
→ Stage 2: groupBy + count (wide — shuffle needed)
→ show
Shuffles — the expensive part
A shuffle happens when data needs to move between executors (groupBy, join, distinct, repartition).
What happens during a shuffle:
- Mapper tasks write shuffle data to local disk (map output files)
- Reducer tasks fetch the data over the network
- Data is sorted and merged
Shuffle is slow because: disk I/O + network transfer + sort.
How to minimize shuffles
# Bad: two separate shuffles
df.groupBy("country").count()
df.groupBy("country").sum("revenue")
# Good: one shuffle
df.groupBy("country").agg(count("*"), sum("revenue"))
# Bad: shuffle both sides
large_df.join(medium_df, "id")
# Good: broadcast the small side (no shuffle)
from pyspark.sql.functions import broadcast
large_df.join(broadcast(small_df), "id")
Rule of thumb: if one side of a join fits in memory (~100MB), broadcast it.
Partitions
The number of partitions controls parallelism.
# See partition count
df.rdd.getNumPartitions()
# Repartition (triggers shuffle, even distribution)
df.repartition(200)
# Coalesce (no shuffle, reduces partitions only)
df.coalesce(10)
Default shuffle partitions: spark.sql.shuffle.partitions = 200
For small datasets this is way too many. For large datasets it may be too few.
Rule of thumb: target 100–500MB of data per partition.
Common performance problems
| Symptom | Cause | Fix | |---|---|---| | One task takes 10x longer | Data skew — one partition has most data | Salting, AQE | | OOM on Driver | collect() on large data | Use write() instead | | OOM on Executor | Partition too large | Repartition / increase executor memory | | Slow joins | Missing broadcast | Add broadcast() hint | | Thousands of small files | Too many output partitions | Coalesce before write |
Adaptive Query Execution (AQE)
Spark 3.0+ dynamically adjusts the plan at runtime:
- Merges small shuffle partitions automatically
- Converts joins to broadcast joins if one side is small
- Handles skew by splitting large partitions
Enable it: spark.sql.adaptive.enabled = true (default in Spark 3.2+)
Key configs to know
spark.executor.memory=4g
spark.executor.cores=4
spark.sql.shuffle.partitions=200 # tune for your data size
spark.sql.adaptive.enabled=true
spark.sql.adaptive.skewJoin.enabled=true
spark.serializer=org.apache.spark.serializer.KryoSerializer