Apache Spark Basics – Accumulators and Broadcast Variables

In my previous post I talked about of RDDs as an abstraction of parallel data processing. Today, I’d like to briefly discuss and set an example for accumulators and broadcast variables.

Accumulators

  • counters or sums that can be reliably used in parallel processing
  • native support for numeric types, extensions possible via API
  • workers can modify state, but cannot read content
  • only a driver program can read the accumulated value

Broadcast Variables

  • allow for an efficient sharing of potentially large data sets
  • intended for workers as reference data
  • cached, transported via broadcast protocol, (de)serialized

Suppose we are assigned with a task to write a text analyser (Github). For instance, we could upload a larger piece of a text (English only for simplicity), such as an e-book, and collect some basic facts: total number of characters and words, as well as a list of the most frequent words.

As the text is broken into smaller chunks processed in parallel, the counting part naturally lends itself to the use of accumulators.

https://gist.github.com/zezutom/f4d214e92e8867a8814b

An effort to “describe” the book by extracting the most frequent words is surely somewhat more exciting than just counting characters. One of the first challenges is with words which are typically overused, but do not carry any significant information, such as articles, prepositions, pronouns and even some verbs. Our application maintains a list of these words and ensures they are ruled out when parsing the book content.

This is when broadcast variables come into play. The list of common words could potentially run long and it would be inefficient to create a copy in a each and every worker. Using a broadcast variable helps performance via caching and reduced network traffic thanks to a specialized broadcast protocol.

https://gist.github.com/zezutom/24c87900224969edc9d3

Finally, we can have some fun and reach out for some genuine master piece, such as 20.000 Leagues under the Sea by Jules Verne. Courtesy of textfiles.com. Here is what the text analyser concluded about the remarkable book:

characters: 568889, words: 101838, the most frequent words:
(captain,564)
(nautilus,493)
(nemo,334)
(ned,283)
(sea,273)

The source code along with detailed instructions can be found on Github as part of a project called Spark by Example.

Similar Posts