Notes

  1. Spark overview: driver - SparkSession + executor(s) - parallel processes
  2. Shuffle - exchange partitions over cluster (computation cost can be pretty high)
  3. DataFrame - use explain() to see lineage and transformation plan
  4. DataFrame (untyped - Row) vs. Dataset (typed - JVM)
  5. Partitioning - based on values in columns or nondeterministically
  6. Scala: “column” / col(“column”) / $column / ‘column / expr(“column”) - big issue: columns ARE expressions
  7. Repartition - full OBLIGATORY shuffle
  8. JOIN - shuffle vs. broadcast
  9. Managed tables - by default spark.sql.warehouse.dir (/user/hive/warehouse)
  10. Worker node: executor
  11. Best practice for selecting number of partitions - multiplied number of executors
  12. Task - unit of computation applied to unit of data
  13. Action -> job -> multiple stages (= physical repartition of data) -> multiple tasks
  14. Spark speculation configuration - duplicated task (node issues, latency issues, etc.)
  15. Driver OOM - potential causes: collect() on large datasets (not enough memory), converting between languages (Python/Scala), corrupted data (NULL != “” or “null” or “empty”, or “n.a.”)
  16. Kryo serializer - more efficient, allows to register classes
  17. Serialization - beware in RDDs and Datasets, when in UDFs serialize only required fields, not whole objects
  18. Garbage collector (GC) - useful logs
  19. Structured streaming - transformations with single action = stream: -> allowed sources/inputs and sinks -> schema inference - must be enabled in configuration -> checkpoint - to recover easily from failure and continue processing
  20. Optimizng Spark with Databricks Academy
  21. Spark Cost Optimizer: https://www.youtube.com/watch?v=WSIN6f-wHcQ