Quick notes on some basic info about Spark, materials from BerkeleyX: CS190.1x Scalable Machine Learning on edX.
Why we may want to use Spark:
Resilient Distributed Dataset (RDD)
- RDD cannot be changed after they are constructed
- They can be created by transformations applied to existing RDDs
- They enable parallel operations on collections of distributed data
- They track lineage information to enable efficient recomputation of lost data
Some basic Spark concept: Transformations
- Transformations are not computed right away
- Transformations are not vulnerable to machine failures
- Transformation are like a recipe for creating a result
Some basic Spark concept: Actions
Property of Spark Actions:
- They cause Spark to execute the recipe to transform the source data
- They are the primary mechanism for getting results out of Spark
- The results are returned to the driver
About Cashing RDD
If you plan to reuse an RDD, you should cache it, so that these RDD don’t have to be read again and again from hard-drive, instead they can be accessed from RAM.
Spark Programming Life-cycle
Some key points about Spark Program Lifecycle:
- RDDs that are reused may be cached
- Transformations lazily create new RDDs
- Transformations create recipes for peforming parallel computation on datasets
- Actions cause parallel computation to be immediately executed
A few words about Key-Value Transformations
PySpark Shared Variables:
In iterative or repeated computations, broadcast variables avoid the problem of repeatedly sending the same data to workers. Broadcast variables are an efficient way of sending data once that would otherwise be sent multiple times automatically in closures.
Accumulators can only be written by workers and read by the driver program.