Do you know when you should be using Apache Spark or Apache Flink? They are both open-sourced, distributed processing frameworks but each will be applicable depending on your project.
Check out this blog where Software Consultant, Anmol Sarna looks at the features and when you should choose either one!
'In our last blog, we had a discussion about the latest version of Spark i.e 2.4 and the new features that it has come up with.
While trying to come up with various approaches to improve our performance, we got the chance to explore one of the major contenders in the race, Apache Flink.
Apache Flink is an open source platform which is a streaming data flow engine that provides communication, fault-tolerance, and data-distribution for distributed computations over data streams. It started as a research project called Stratosphere. Stratosphere was forked, and this fork became what we know as Apache Flink. In 2014, it was accepted as an Apache Incubator project, and just a few months later, it became a top-level Apache project.
It is a scalable data analytics framework that is fully compatible with Hadoop. Flink can execute both stream processing and batch processing easily.
In this blog, we will try to get some idea about Apache Flink and how it is different when we compare it to Apache Spark.
If you guys want to know more about Apache Spark, you can go through some of our blogs about Spark RDDs and Spark Streaming.
Apache Spark vs Apache Flink
Apache Spark and Apache Flink are both open- sourced, distributed processing framework which was built to reduce the latencies of Hadoop Mapreduce in fast data processing.
Both Spark and Flink support in-memory processing that gives them distinct advantage of speed over other frameworks.
By the time Flink came along, Apache Spark was already the de facto framework for fast, in-memory big data analytic requirements for a number of organizations around the world. This made Flink appear superfluous. But keep in mind that Apache Flink is closing this gap by the minute. More and more projects are choosing Apache Flink as it becomes a more mature project.
When to choose Apache Spark:
- When it comes to real-time processing of incoming data, Flink does not stand up against Spark, though it has the capability to carry out real-time processing tasks.
- If you don’t need bleeding edge stream processing features and want to stay on the safe side, it may be better to stick with Apache Spark. It is a more mature project it has a bigger user base, more training materials, and more third-party libraries.
- In case you want to move over to a somewhat “more reliable” technology, one would choose Spark as it has a much active and wider community which is constantly increasing.
When to choose Apache Flink:
- When it comes to speed, Flink gets the upper hand as it can be programmed to process only the data that has changed, which is where it comes out on top of Spark. The main reason for this is its stream processing feature, which manages to process rows upon rows of data in real time – which is not possible in Apache Spark’s batch processing method. This makes Flink faster than Spark.
- If you need to do complex stream processing, then using Apache Flink would be highly recommended.
- Moreover, if you like to experiment with the latest technology, you definitely need to give Apache Flink a shot.
Features of Apache Flink
- Stream processing
Flink is a true streaming engine, can process live streams in the sub-second interval.
- Easy and understandable Programmable APIs
Flink’s APIs are developed in a way to cover all the common operations, so programmers can use it efficiently.
- Low latency and High Performance
Apache Flink provides high performance and Low latency without any heavy configuration. Its pipelined architecture provides the high throughput rate. It processes the data at lightning fast speed, it is also called as 4G of Big Data.
- Fault Tolerance
The fault tolerance feature provided by Apache Flink is based on Chandy-Lamport distributed snapshots, this mechanism provides strong consistency guarantees.
Also, the failure of hardware, node, software or a process doesn’t affect the cluster.
- Ease of use
The Flink APIs make it easier to use than programming for MapReduce and it is easier to test as compared to Hadoop.
- Memory Management
The memory management in Apache Flink provides control on how much memory we use in certain runtime operations. Thus we can say that Flink works in managed memory and never get out of memory exception.
- Scalable
Flink is highly scalable. With increasing requirements, we can scale Flink cluster.
- Easy Integration
We can easily integrate Apache Flink with other open source data processing ecosystem. It can be integrated with Hadoop, streams data from Kafka, It can be run on YARN.
- Exactly-once Semantics
Another important feature of Apache Flink is that it can maintain custom state during computation.
- Rich set of operators
Flink has lots of pre-defined operators to process the data. All the common operations can be done using these operators
Simple WordCount Program in Scala
The example project contains a 'WordCount' implementation, the “Hello World” of Big Data processing systems. The goal of 'WordCount' is to determine the frequencies of words in a text.
Initially, we will require to add the dependency for Flink streaming in Scala in build.sbt:
libraryDependencies += "org.apache.flink" %% "flink-streaming-scala" % "1.7.0"
Note: in our example, we will be using the latest version of the dependency i.e. 1.7.0
Second thing import the required classes into your code:
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.scala._
Here, the ParameterTool is only used as it provides the utility methods for reading and parsing program arguments from different sources.
Once we have made all the imports, we require 2 major things
- Input Parameter
val params = ParameterTool.fromArgs(args)
- Execution environment
val env = StreamExecutionEnvironment.getExecutionEnvironment
Once you have the environment and the params, we need to make these parameters available, we need to add the following:
env.getConfig.setGlobalJobParameters(params)
Now time to get the input data for streaming, we can either give an input file or we can have a predefined input data.
1 val text =
2 // read the text file from given input path
3 if (params.has("input")) {
4 env.readTextFile(params.get("input"))
5 } else {
6 println("Executing WordCount example with default inputs data set.")
7 println("Use --input to specify file input.")
8 // get default test text data
9 env.fromElements(WordCountData.WORDS: _*)
10 }
Once we have the input data, we will now apply the various operations on the input in order to obtain the word count for each word.
1 val counts: DataStream[(String, Int)] = text
2 // split up the lines in pairs (2-tuples) containing: (word,1)
3 .flatMap(_.toLowerCase.split("\\W+"))
4 .filter(_.nonEmpty)
5 .map((_, 1))
6 // group by the tuple field "0" and sum up tuple field "1"
7 .keyBy(0)
8 .sum(1)
Now that we have got the counts for each word, we can have the output as an output file or as console output.
1 if (params.has("output")) {
2 counts.writeAsText(params.get("output"))
3 } else {
4 println("Printing result to stdout. Use --output to specify output path.")
5 counts.print()
6 }
To finally execute the program simply call the execute method along with the jobName
env.execute(jobName = "WordCount01")
Now to test this simple program, simply run the
sbt run command from your terminal and you are ready to go.
I hope now you guys have some idea about Apache Flink and how is it different from Apache Spark. We have also tried to cover up a simple wordcount code using Flink.
We have also added the code for this demo as well as the sample data (WordCountData.WORDS) for your reference here.
For more details, you can refer to the official documentation for Apache Flink.
Hope this helps.'
This article was written by Anmol Sarna and posted originally on blog.knoldus.com