Driver OOM - potential causes: collect() on large datasets (not enough memory), converting between languages (Python/Scala), corrupted data (NULL != “” or “null” or “empty”, or “n.a.”)
Kryo serializer - more efficient, allows to register classes
Serialization - beware in RDDs and Datasets, when in UDFs serialize only required fields, not whole objects
Garbage collector (GC) - useful logs
Structured streaming - transformations with single action = stream:
-> allowed sources/inputs and sinks
-> schema inference - must be enabled in configuration
-> checkpoint - to recover easily from failure and continue processing