Apache Spark Basics – RDDs and Operation Types
When starting with Apache Spark, a “lightning-fast cluster computing” engine, it is important to understand how Spark fits into the Hadoop ecosystem. This article provides a brief overview of Spark’s distinctive features and its ties Hadoop.
Hadoop has been around for about 12 years and it dominated the space of Big Data by providing reliable distributed processing of large data sets. Hadoop was inspired by and provides a framework for a programming pattern called MapReduce invented by Google. To guarantee reliable processing Hadoop comes with its own flavour of a distributed file system known as HDFS. Arguably one of the most valuable features of HDFS is automatic fault detection and quick recovery from errors. There are other parts comprising Hadoop’s architecture, but let’s leave them out for now.
Reliable and robust as it is, Hadoop is not designed for speed. That’s where Spark comes into play. As disc operations result into a slowdown, Spark’s architecture revolves around in-memory processing in a form of Resilient Distributed Data Sets (RDDs). When it comes to persistence Spark cleverly skips a solution of its own and instead it exposes a high-level API towards HDFS and other storage systems.
That’s enough for introduction. Let me quickly iterate over Spark’s core features and finally show some code to make it all look a little bit less abstract.
RDDs in a Nutshell
- fault-tolerant, as in zero data loss
- are operated on in parallel in a cluster of workers
- wrap around the actual data (textual, numeric or custom objects)
There are two types of operations which can be performed on RDDs: transformations and actions.
Transformations
- create new RDDs according to transformation rules
- are lazy, i.e. nothing happens until an action is called
- represent the Map part of the MapReduce paradigm
Actions
- trigger a transformation workflow
- return aggregated result back to the driver program (beware memory consumption)
- represent the Reduce part of the MapReduce paradigm
Code time! A word count example is a hello world in Big Data realm. The example parses a file containing random text (lorem ipsum) and produces a lexically ordered set of word counts, such as this one:
(a,9) (ac,13) (accumsan,1) (ad,1) (adipiscing,1) (aenean,6) (aliquam,7) .. (lorem,5) (luctus,5) (maecenas,1) (magna,8) ..
As Spark leverages MapReduce, the corresponding code only takes a few lines (of Scala):
https://gist.github.com/zezutom/edbdc5b1dd75e4b990a5
Spark’s transformations take up most of those ten lines of Scala code: flatMap, map and filter. Once we are happy with how the text is manipulated, we apply actions to collect the results. Namely reduceByKey, which sums up occurrences of each and every word, and sortByKey to have the final set lexically sorted.
Typically, an action would be performed on the driver. Interestingly enough, that’s not the case with the actions above, both of which are processed in parallel and therefore return yet another RDD. To get to the actual results, a sorted set of word count pairs, we reach out for Spark’s high-level API and persist the output into HDFS by calling saveAsTextFile on the RDD.
There is much more to grasp when beginning with Apache Spark and I hope this brief write-up gives you an idea. The source code along with detailed instructions can be found on Github as part of a project called Spark by Example.