This article was written by Nate Kupp, who loves building
infrastructure, currently working at Thumbtack,
formerly at Apple.
My thoughts on language choices in Spark.
At Thumbtack, the vast majority of our batch jobs and data
processing pipelines are built on Spark, and specifically,
A question I’ve gotten frequently from engineers on the team,
when they first start diving into this codebase and realize that
Spark also supports Python:
“Why do we use Scala? Why not Python?”
Most people (including myself) find Scala hard to learn at first.
It’s a complicated language with a large surface area, and to be
honest, you need to learn an awful lot of the language to really be productive—and especially to understand an existing codebase.
Because Scala is so difficult, it is surprising to many that we’d choose this as our supported language for data processing at Thumbtack. I’ve heard “Python is much easier to use than Scala”, “Python has broader adoption”, “Python is easier to debug”, “Python is faster to iterate”.
Why Scala, then? Presumably we’re not all just suckers for pain.
To answer this question, I want to point to a few of the core priorities we think about in our data infrastructure work. These are:
- Correctness: Ensuring that data is correct and accurate is a P0 for a data engineering team. A bug in a data pipeline often results in subtle errors in downstream systems and analytics, and even egregious bugs are often not caught until weeks or months later—meanwhile, the organization may have made many decisions based off of wrong data.
- Reliability: Every data pipeline that goes into production is one more chance for a late night PagerDuty call. We prefer to sleep soundly, so the more reliable our infrastructure is, the happier (and more productive) our engineering team is.
- Scalability: Rapid growth means that we can’t get away with non-distributed solutions for long. In six months, the Python/pandas or PHP script hacked together in an afternoon is probably going to be the root cause of a prod failure, and it is never fun to drop everything to go and refactor a now production-critical system while under the gun. A good rule of thumb: no matter how much you believe it won’t become production-critical, it always will when it is least convenient.
- Development Speed: We’re still a startup, it’s still day 1, and getting code shipped fast is critical to our success. This means that things like ease of development, discoverability of data, and good tools / automation are critical. There’s a subtle point here; maintaining development speed also means that we need to be thoughtful about the balance of tool fragmentation vs. empowering the team with a diverse range of tools to get things done.
Our systems are all designed around these priorities, and I believe these priorities are largely shared by most data engineers, Thumbtack or otherwise.
So where do we miss each other?
Where I believe Scala and Python advocates miss each other is in the belief that each is misprioritizing one or more of the core priorities I enumerated earlier. A Scala developer would consider Python as undervaluing correctness & reliability, whereas a Python developer might find working in Scala extremely clunky, slow and painful.
In reality, Scala vs. Python in Spark is really just the big data special-case of the broader debate on static vs. dynamic typing. I loved this graphic from a recent article on the subject:
You might easily substitute Scala for Haskell and arrive at similar conclusions based on what your initial assumptions are.
Scala/Spark has served us well towards building data pipelines that are correct, reliable, and scalable. We plan to continue investing engineering resources on making it extremely simple to author and deploy Scala/Spark data processing jobs. Ideally, in most cases, the actual code that an engineer needs to write should be very simple, including only:
- Configuration to specify data source(s)
- Code to express transformations (map, reduce, join, etc.) of the data
- Configuration to specify data sink(s)
Even once we’ve fully realized this level of simplicity, we recognize that Scala/Spark does not perfectly map to every use case. For example, we’d love to enable non-engineers (e.g., our analytics team) to build their own pipelines without needing engineering support. Towards that end, we’re considering the introduction of Pyspark, or even Spark SQL or BigQuery SQL. Our biggest concern with introducing these as options is to ensure that we have consistent approaches to loading and storing data across all of the supported options we provide to the team.
As we look towards future iterations of our data infrastructure, the core priorities above are the lens through which we evaluate and prioritize our infrastructure improvements. Whether we introduce new solutions or invest in existing ones will be driven by how well we can fulfill our commitments to those core priorities. We want to continue ensuring that our data is correct and reliable, that our systems are highly scalable, and that the engineering team isn’t slowed down by fighting with less-than-ergonomic software. Ultimately, I think, that’s where we all can find common ground—regardless of our perspectives on typing.