Spark Guide Notes

Notes

Spark overview: driver - SparkSession + executor(s) - parallel processes
Shuffle - exchange partitions over cluster (computation cost can be pretty high)
DataFrame - use explain() to see lineage and transformation plan
DataFrame (untyped - Row) vs. Dataset (typed - JVM)
Partitioning - based on values in columns or nondeterministically
Scala: “column” / col(“column”) / $column / ‘column / expr(“column”) - big issue: columns ARE expressions
Repartition - full OBLIGATORY shuffle
JOIN - shuffle vs. broadcast
Managed tables - by default spark.sql.warehouse.dir (/user/hive/warehouse)
Worker node:
Best practice for selecting number of partitions - multiplied number of executors
Task - unit of computation applied to unit of data
Action -> job -> multiple stages (= physical repartition of data) -> multiple tasks
Spark speculation configuration - duplicated task (node issues, latency issues, etc.)
Driver OOM - potential causes: collect() on large datasets (not enough memory), converting between languages (Python/Scala), corrupted data (NULL != “” or “null” or “empty”, or “n.a.”)
Kryo serializer - more efficient, allows to register classes
Serialization - beware in RDDs and Datasets, when in UDFs serialize only required fields, not whole objects
Garbage collector (GC) - useful logs
Structured streaming - transformations with single action = stream: -> allowed sources/inputs and sinks -> schema inference - must be enabled in configuration -> checkpoint - to recover easily from failure and continue processing
Optimizng Spark with Databricks Academy
Spark Cost Optimizer: https://www.youtube.com/watch?v=WSIN6f-wHcQ

Spark Guide Notes

Spark Guide Notes - book

Spark Guide Notes

Spark Guide Notes - book

Notes