Connecting...

W1siziisimnvbxbpbgvkx3rozw1lx2fzc2v0cy9zawduawz5lxrly2hub2xvz3kvanbnl2jhbm5lci1kzwzhdwx0lmpwzyjdxq

A Newbie’s Guide to Scala and Why It’s Used for Distributed Computing by David Lyon

W1siziisijiwmtkvmduvmdivmtqvmdyvmdqvnjm1l3blegvscy1wag90by0yndc4mtkuanblzyjdlfsiccisinrodw1iiiwiotawedkwmfx1mdazzsjdxq

Are you considering learning Scala? Why should you use this programming language?

Data Scientist, David Lyon tells us why Scala is the perfect language for your distributed computing project!

 

'David Lyon is an alum from the Insight Data Engineering program in Silicon Valley, now working at Facebook as a Data Engineer. In this post, he tells us why Scala and Spark are such a perfect combination.

In this guide, I will make the case for why Scala’s features make it the ideal language to use for your next distributed computing project.

The most important Scala features:

  1. Scala is functional
  2. Scala is strongly typed
  3. Scala uses the Java Virtual Machine
  4. Spark is written in Scala
  5. Scala is the highest paying language of 2017

The question many new Insight Fellows ask is:

“Should I learn Scala and Data Engineering together?”

For Fellows with a strong math background, functional programming will make perfect sense and you’ll take to it like a Californian to avocado toast. However, I’d like to warn those of you without a math background that are also new to both distributed computing and to functional programming: learning them together will be like learning to ride a unicycle before learning the bicycle.

Scala is Functional

The key distinction between functional and imperative programming is the concept of a pure function.

A pure function’s output only depends on its input parameters, which are immutable. It will always produce exactly the same output given the same input parameters. This output consists entirely of its return value. Any other output is called a “side effect” and renders the function impure.



Example of a Pure Function

Contrast this with imperative programming, in which procedures mutate their input parameters or otherwise modify variables outside of their scope. Imperative procedures often return None, Null, or a similar empty type.

A procedure with side effects

Why do pure functions matter? There are two types of parallelization: task and data. The first introduction of many programmers to the concept is task parallelization. The most common example is multi-threaded programming on a multi-core CPU where each core has access to the same global memory. In that case, imperative programming, which may depend on some internal state or global state, is not so dangerous because the program will produce repeatable results when every thread will see the same data every run.

Data Engineering, however, is about data parallelization.

 

A Data Engineer (DE) may need to process data equal to a thousand times the memory capacity of a single CPU. In that case, the data will have to be sharded into a thousand pieces, and each instance of the procedure will see only a small fraction of the total data. Reasoning about the interaction of thousands or millions of procedures, each with their own independent side effects, becomes impossible. The ways that race conditions or lost data can occur due to interacting side effects are too numerous to list.

Another functional programming advantage of Scala is partial function application.

A traditional function of 5 parameters needs to assemble all 5 parameters at the same place and time before it can operate. However, in Scala, a function of 5 parameters can, for example, accept 3 parameters and return a function of 2 parameters.

A common use case of partial function application could be combining continuously arriving streaming user data with daily aggregated data that arrives once per day, after midnight. The 5 parameter function could be continuously transformed into its 2 parameter version all day by evaluating 3 parameters from streaming user data. The 2 parameter functions could be sent to a new location to await midnight. At midnight, an entire day’s worth of 2 parameter functions could finish their calculations by applying the last two parameters from the aggregated data to the function.

The special case where parameters are evaluated one at a time is called currying. The simplest case of currying, two parameters being evaluated sequentially, is shown below.

 

Lastly, Scala allows lazy evaluation. Loosely speaking, lazy evaluation means passing unevaluated parameters forward as functions until an answer is required, and then evaluating the parameters when a value is needed for I/O.

Let me give an example. You have a billion rows of scanned user data and you want to find 10 rows with “Steve” in the first name field. However, the data quality is dubious, some rows contain names like “SteVe” and “?ste!ve”.

First you need to strip non-alphanumeric characters, then convert the name to lowercase, and finally compare “steve” with the result. Without lazy evaluation, your program will first strip non-alphanumerics from 1 billion rows, then convert 1 billion rows to lowercase, and finally discover that the first 10 rows were “steve” the whole time. With lazy evaluation, Scala will stream single rows through the 3-function pipeline until it finds the 10 results it needs, and then halt streaming. Whew, 999,990 rows of work saved by being lazy!

 

Scala uses the JVM

As of the Fall of 2017, Java is still the most popular programming language. Why abandon Java for Scala, a much newer and much less popular language? Java brings two decades of packages and the Java Virtual Machine, which allows the same Java code to run on any hardware. Fortunately, you don’t have to give up anything! Scala compiles to Java bytecode and is fully compatible with all Java libraries. Think of Scala as concise Java, but fully functional.

 

Scala is Strongly Typed

Scala has a strong type system, including static types and type variance. Within the last few years, the trend has been towards stronger type systems. For example, Hack was developed by Facebook in 2014 to bring static typesto PHP to allow it to scale to thousands of engineers. Similarly, TypeScript, developed by Microsoft in 2012, is the statically typed counterpart to JavaScript and has been replacing its parent language in many large scale projects.

Dynamic type systems, like those used in Python, PHP, and JavaScript, can make prototyping and small projects easier. However, complex dynamically typed code is difficult to reason about. One advantage of statically typed languages is that the IDE or compiler will catch type errors, whereas such errors in dynamically typed languages will only show up at runtime. Similarly, a type variance system allows errors related to higher order types to be caught by the IDE or compiler, rather than at runtime.

 

Spark is Written in Scala

In 2017, Spark passed Hadoop in search popularity. The current generation of big data companies still store their data in the Hadoop distributed file system (HDFS). However, Hadoop MapReduce, its disk-based big data processing engine, is being replaced by a new generation of memory-based processing frameworks, the most popular of which is Spark. Spark supports Scala, Java, Python, and R. While the latter two grant access to a wide array of Data Science tools, they are not JVM based languages and are better for data exploration, prototyping, and small projects, where results can be collected onto a single machine.
Anyone who aspires to become a Spark guru must be able to read the source code, which is written in Scala. This also means that the Spark Scala API is always the most up to date and best documented. In addition, Scala’s functional paradigm and strong typing allow one to reason about large projects at a small granularity without worrying about runtime type mismatches and side effects.
 

Conclusion

Who should use Scala for their next Spark project? People who are already familiar with DE or Spark, with functional programming but not Scala, or who have a strong mathematical background. If you’re unfamiliar with DE and already know Python or Java well, learn Spark first using Python or Java. Scala will be waiting for you. By the way, Scala is the highest paying language of 2017, in case you had forgotten.

 

Next steps

Want to learn Scala? The canonical source is Coursera’s Functional Programming in Scala Specialization by Martin Odersky, the designer of Scala. The five courses are:
  1. Functional Programming Principles in Scala
  2. Functional Program Design in Scala
  3. Parallel programming
  4. Big Data Analysis with Scala and Spark
  5. Functional Programming in Scala Capstone

A great book for learning both Scala and Functional Programming is Functional Programming in Scala, which teaches the most important functional programming concepts using concise Scala exercises, assuming no starting knowledge of either.'

 
This article was written by David Lyon and posted originally on Medium.