My thoughts on language choices in Spark.
At Thumbtack, the vast majority of our batch jobs and data
processing pipelines are built on Spark, and specifically,
A question I’ve gotten frequently from engineers on the team,
when they first start diving into this codebase and realize that
Spark also supports Python:
“Why do we use Scala? Why not Python?”
Most people (including myself) find Scala hard to learn at first.
It’s a complicated language with a large surface area, and to be
honest, you need to learn an awful lot of the language to really be productive—and especially to understand an existing codebase.
Because Scala is so difficult, it is surprising to many that we’d choose this as our supported language for data processing at Thumbtack. I’ve heard “Python is much easier to use than Scala”, “Python has broader adoption”, “Python is easier to debug”, “Python is faster to iterate”.
Why Scala, then? Presumably we’re not all just suckers for pain.
To answer this question, I want to point to a few of the core priorities we think about in our data infrastructure work. These are:
- Correctness: Ensuring that data is correct and accurate is a P0 for a data engineering team. A bug in a data pipeline often results in subtle errors in downstream systems and analytics, and even egregious bugs are often not caught until weeks or months later—meanwhile, the organization may have made many decisions based off of wrong data.
- Reliability: Every data pipeline that goes into production is one more chance for a late night PagerDuty call. We prefer to sleep soundly, so the more reliable our infrastructure is, the happier (and more productive) our engineering team is.
- Scalability: Rapid growth means that we can’t get away with non-distributed solutions for long. In six months, the Python/pandas or PHP script hacked together in an afternoon is probably going to be the root cause of a prod failure, and it is never fun to drop everything to go and refactor a now production-critical system while under the gun. A good rule of thumb: no matter how much you believe it won’t become production-critical, it always will when it is least convenient.
- Development Speed: We’re still a startup, it’s still day 1, and getting code shipped fast is critical to our success. This means that things like ease of development, discoverability of data, and good tools / automation are critical. There’s a subtle point here; maintaining development speed also means that we need to be thoughtful about the balance of tool fragmentation vs. empowering the team with a diverse range of tools to get things done.
Our systems are all designed around these priorities, and I believe these priorities are largely shared by most data engineers, Thumbtack or otherwise.
So where do we miss each other?