Are you considering learning Scala? Why should you use this programming language?
Data Scientist, David Lyon tells us why Scala is the perfect language for your distributed computing project!
'David Lyon is an alum from the Insight Data Engineering program in Silicon Valley, now working at Facebook as a Data Engineer. In this post, he tells us why Scala and Spark are such a perfect combination.
In this guide, I will make the case for why Scala’s features make it the ideal language to use for your next distributed computing project.
The most important Scala features:
- Scala is functional
- Scala is strongly typed
- Scala uses the Java Virtual Machine
- Spark is written in Scala
- Scala is the highest paying language of 2017
The question many new Insight Fellows ask is:
“Should I learn Scala and Data Engineering together?”
For Fellows with a strong math background, functional programming will make perfect sense and you’ll take to it like a Californian to avocado toast. However, I’d like to warn those of you without a math background that are also new to both distributed computing and to functional programming: learning them together will be like learning to ride a unicycle before learning the bicycle.
Scala is Functional
The key distinction between functional and imperative programming is the concept of a pure function.
A pure function’s output only depends on its input parameters, which are immutable. It will always produce exactly the same output given the same input parameters. This output consists entirely of its return value. Any other output is called a “side effect” and renders the function impure.
Example of a Pure Function
Contrast this with imperative programming, in which procedures mutate their input parameters or otherwise modify variables outside of their scope. Imperative procedures often return None, Null, or a similar empty type.
A procedure with side effects
Why do pure functions matter? There are two types of parallelization: task and data. The first introduction of many programmers to the concept is task parallelization. The most common example is multi-threaded programming on a multi-core CPU where each core has access to the same global memory. In that case, imperative programming, which may depend on some internal state or global state, is not so dangerous because the program will produce repeatable results when every thread will see the same data every run.
Data Engineering, however, is about data parallelization.
A Data Engineer (DE) may need to process data equal to a thousand times the memory capacity of a single CPU. In that case, the data will have to be sharded into a thousand pieces, and each instance of the procedure will see only a small fraction of the total data. Reasoning about the interaction of thousands or millions of procedures, each with their own independent side effects, becomes impossible. The ways that race conditions or lost data can occur due to interacting side effects are too numerous to list.
Another functional programming advantage of Scala is partial function application.
A traditional function of 5 parameters needs to assemble all 5 parameters at the same place and time before it can operate. However, in Scala, a function of 5 parameters can, for example, accept 3 parameters and return a function of 2 parameters.
A common use case of partial function application could be combining continuously arriving streaming user data with daily aggregated data that arrives once per day, after midnight. The 5 parameter function could be continuously transformed into its 2 parameter version all day by evaluating 3 parameters from streaming user data. The 2 parameter functions could be sent to a new location to await midnight. At midnight, an entire day’s worth of 2 parameter functions could finish their calculations by applying the last two parameters from the aggregated data to the function.
The special case where parameters are evaluated one at a time is called currying. The simplest case of currying, two parameters being evaluated sequentially, is shown below.
Lastly, Scala allows lazy evaluation. Loosely speaking, lazy evaluation means passing unevaluated parameters forward as functions until an answer is required, and then evaluating the parameters when a value is needed for I/O.
Let me give an example. You have a billion rows of scanned user data and you want to find 10 rows with “Steve” in the first name field. However, the data quality is dubious, some rows contain names like “SteVe” and “?ste!ve”.
First you need to strip non-alphanumeric characters, then convert the name to lowercase, and finally compare “steve” with the result. Without lazy evaluation, your program will first strip non-alphanumerics from 1 billion rows, then convert 1 billion rows to lowercase, and finally discover that the first 10 rows were “steve” the whole time. With lazy evaluation, Scala will stream single rows through the 3-function pipeline until it finds the 10 results it needs, and then halt streaming. Whew, 999,990 rows of work saved by being lazy!
Scala uses the JVM
Scala is Strongly Typed
Spark is Written in Scala
Who should use Scala for their next Spark project? People who are already familiar with DE or Spark, with functional programming but not Scala, or who have a strong mathematical background. If you’re unfamiliar with DE and already know Python or Java well, learn Spark first using Python or Java. Scala will be waiting for you. By the way, Scala is the highest paying language of 2017, in case you had forgotten.
- Functional Programming Principles in Scala
- Functional Program Design in Scala
- Parallel programming
- Big Data Analysis with Scala and Spark
- Functional Programming in Scala Capstone
A great book for learning both Scala and Functional Programming is Functional Programming in Scala, which teaches the most important functional programming concepts using concise Scala exercises, assuming no starting knowledge of either.'