Spark Essentials

by Fuyang

Quick notes on some basic info about Spark, materials from BerkeleyX: CS190.1x Scalable Machine Learning on edX.

Why we may want to use Spark:

Capture

Resilient Distributed Dataset (RDD)

  • RDD cannot be changed after they are constructed
  • They can be created by transformations applied to existing RDDs
  • They enable parallel operations on collections of distributed data
  • They track lineage information to enable efficient recomputation of lost data

 

Some basic Spark concept: Transformations

1 - transformation2 - transformation

  • Transformations are not computed right away
  • Transformations are not vulnerable to machine failures
  • Transformation are like a recipe for creating a result

 

Some basic Spark concept: Actions

3 - action4. simple spark action example 15. simple spark action example 2

Property of Spark Actions:

  • They cause Spark to execute the recipe to transform the source data
  • They are the primary mechanism for getting results out of Spark
  • The results are returned to the driver

 

About Cashing RDD

If you plan to reuse an RDD, you should cache it, so that these RDD don’t have to be read again and again from hard-drive, instead they can be accessed from RAM.

6. spark action example 3(read data twice)7. spark action example 4(read data once with cache,fast)

 

Spark Programming Life-cycle

8. spark life cycle

Some key points about Spark Program Lifecycle:

  • RDDs that are reused may be cached
  • Transformations lazily create new RDDs
  • Transformations create recipes for peforming parallel computation on datasets
  • Actions cause parallel computation to be immediately executed

 

A few words about Key-Value Transformations9. spark key value operations10. spark key value operations. exp111. spark key value operations. exp2

 

PySpark Shared Variables:

In iterative or repeated computations, broadcast variables avoid the problem of repeatedly sending the same data to workers. Broadcast variables are an efficient way of sending data once that would otherwise be sent multiple times automatically in closures.

Accumulators can only be written by workers and read by the driver program.