Do you want to know more about Kafka?
Scala and Data Engineer, Stéphane Derosiaux tells us about the experiences and feedback he has had using the platform. It is an amazing tool, so let's learn more about it!
'Today, someone asked me this question: “What makes you a Kafka expert?” at the beginning of a job interview where I applied to work as a Freelance. The job was interesting because the company was considering moving from a monolith to a Kafka stack and micro-services. A huge step! I wanted to know more. At the end, the job interview looked more like a training seance than an interview. Damned, I didn’t even bill them!
By no means I’m considering myself a Kafka expert. I lack a lot of knowledge about Kafka: I can’t answer all the questions one asked me. I’m not a committer. I don’t understand all KIPs. I just have some intuitions when I don’t know something but it’s not enough. I’m just a regular developer using the Kafka platform since a few years, so I know a thing or two.
I’ll try to transcribe some elements of the long answer I gave them during the interview. I’m not diving into too much technical details, but just some feedbacks and experiences I had. This article can be a bit intimidating, I’m sorry.
Kafka can be a beast difficult to handle at first, but at the end, I love it.
To get more Kafka (or Scala) insights, you can follow me on Twitter: https://twitter.com/sderosiaux
Kafka is a platform
First, Kafka is not a standalone software. A few years ago it was, but not anymore. Kafka is definitely a platform — a Streaming platform: it comes with multiple softwares and integrations. It’s not mandatory to use them, but they are quite idiomatic to use when working with Kafka, because they meet our growing needs:
- Kafka Streams: Kafka to Kafka
- Kafka Connect: DB to Kafka or Kafka to DB
- Kafka REST Proxy: For legacy apps only dealing with HTTP(S)
- Kafka Mirror Maker: Cross data-center replication or just cross-clusters
- Confluent Avro Schema Registry
- Confluent KSQL
- Administration: some interfaces to monitor, we have a large panel available: Confluent Control Center, LinkedIn’s Burrow, LinkedIn’s Cruise Control, Kafka Eagle, KafkaHQ (by a good friend of mine, check it out!)
They all answer to some specific and necessary needs any project will run into. We don’t have to try to write a consumer that exposes data through REST, it exists already. We don’t have to consume our Kafka ourself to populate your Elasticsearch, it exists already.
We could implement all those things ourself, just by relying on the pure Kafka Consumer/Producer APIs, sure, but we’d probably forget some cases and run into bugs the community has already fixed (or not). Why bother? Learn about the whole platform.
Being up to date
An expert in any domain should know the latest changes of its domain.
Kafka having a very active community (Confluent and contributors), the versions are bumping quite fast. The APIs change, new features are regularly added, new use-cases are unlocked, code can be simplified, software licences can changed, and so on.
While we are working on our company project, we also need to be up to date with the Kafka* releases, follow the KIPs, “understand” them, check out Twitter for the great advices and blog posts of the Confluent team, look at StackOverflow’s questions to see if you can answer them etc. It’s a full-time job!
Covering all of that makes us aware of where Kafka is “moving”, what are the common use-cases of the developers and their needs. We can also see what are their difficulties (that you could run into one day): always know your enemies!
Kafka has its semantics to learn
When working with Kafka, we must know how to talk about a lot of things:
- topics, partitions, hot spots, immutability, ordering guarantees, keys, partitions leadership, replicas, preferred replica, offsets, reset, consumer groups, lag, broker discovery, zero-copy, bi-directional compatibility, retries, idempotence, max in-flights requests, batching strategies, logs, segments, segments indexes, delivery semantics, polling strategies, commits strategies, heartbeats, keyed topics, partitioner, replication factor, compaction, tombstones, compression, retention, advertised listeners, group coordinator, transactions, watermarks…
If you are working with Kafka, can you describe all of them?
And that’s just about Kafka, I’m not even talking about Kafka Connect nor Kafka Streams!
If you’re starting with Kafka, definitely consider hiring some trainer for a few days to start on good bases (hello!).
One size does not fit all
Unfortunately, Kafka is not the magic solution to fix all our problems. It’s not because some architect in your company told us to use Kafka (because it’s hype) that we should use it. We should think business first.
Does Kafka fit our use-case? Kafka is a durable distributed and scalable pubsub system: do we need this power?
- With power comes great responsibilities: are we aware of them?
- Do we have sufficient throughput at least?
- Do we want to enrich our data?
- Are we going to expose our data to other services?
- Can we lose messages, are the data critical?
- Does it make sense to have multiple consumers consuming the same data? Or do we want just a queue with some workers processing the task specified in the message and ack the message atomically (definitely not meant for Kafka!)?
- Do we need data retention?
- Are the data are business oriented? Did we think of format evolution?
- Are we going to need the whole Kafka Platform?
So many questions to wonder first, to see if Kafka fits the needs.
Sometimes a more classic RabbitMQ is way enough. Also, a project can deal with a RabbitMQ and a Kafka, it’s not exclusive, both have different semantics and answer different needs.
One configuration does not fit all
If we know we need Kafka, I’m glad! It means it fits one of our use-cases. But Kafka is far from being “one configuration does fit all”.
Each cluster has its own set of configurations according to the constraints of its environment (security, addresses, ports, threading, segments, compactions, replicas, rack management…), and expose default values inherited by all the topics.
But all topics are not meant to answer the same needs and have a lots of knobs to alter their behavior and how the broker treats them:
- partition count & replication factor
- keys or not, and which one
- retention & compaction
- log rolling
- Unclean leadership management (on the broker)
Deep thoughts should been put into the configuration of the brokers and the topics. The “defaults” are just not good enough for a real usage: Kafka doesn’t know our exact use-cases.
Producer and Consumers
We also have a lot of properties on the consumer and producer sides. Those should be carefully set according to the use-cases and semantics of the data they deal with. Forgetting a property could lead to missing data, duplicates, inconsistencies, or worse, un-availability.
- keys / partitions
- acks & min ISR
- retries & idempotence
- batching strategy
- buffer (when Kafka is not available!)
- consumer group
- session timeout
- auto offset reset
- polling strategy
- commit strategy
- processing time
Again, all of that needs to be think through. Kafka won’t say anything if we don’t set them, it will just fallback to some defaults, but that may burn us one day.
Kafka needs a good operational team
People asked me a lot if I have already installed Kafka, I wonder why. Of course I had, and I don’t want to install it anymore. I’d rather have it managed by a dedicated team or better: a managed service. I think it’s an error for a team to handle it if they don’t have the proper resources. It’s not just “I’ll unzip it and be done with it”. No it won’t.
Kafka is far from being straightforward to operate against. It’s not just something we deploy and restart when it crashes (hello Log Cleaner thread!). One needs to understand what is happening.
A proper monitoring must be set, not only on the CPU/Memory/Network but also on Kafka internals. Kafka (and its producers/consumers) exposes some JMX data about partitions, elections, logs size, rates, and much more. The team managing it must understand what they’re seeing, and how interpret the values. They should be able to reassign partitions, refresh the preferred replica, do rolling upgrades, add or remove brokers and more.
As the time passes, new topics get created, and old topics dies… and are not cleaned up! (especially if you are working with Kafka Streams) Hopefully, we have defined a retention period for all of them, but we still have tons of topics defined in our clusters, which are meaningless and confusing.
Also, there is a huge Security part to handle: encryption, authentication, and authorization. Kafka can deal with SSL, SASL, and has ACLs. What are the impact of enabling them? (eg: we want SSL? No more zero-copy for us :( It means 30% less throughput.)
To deal with ACLs at scale, one can use my friend Stéphane Maarek’s great Security Manager: https://github.com/simplesteph/kafka-security-manager
Last but not least, Kafka still depends upon Zookeeper for the time being. Zookeeper needs 3 or 5 servers to play nicely with a Kafka cluster that needs itself 3 or 5 or more brokers according to its volume of data.
It’s a lot of machines at the end to deal with (ZK and Kafka should not be co-hosted on the same machine in case of failure) and operators should know when it’s time to add a node (or to remove one if being under-used). Hello Cloud.
Kafka brings a change of mindset
One of the most important thing I think when working with Kafka, is the change of mindset.
Nowadays, a lot of teams are still working with batches (constrained by old systems), fat databases containing the single source of truth, in-place UPDATEs, are used to call 10 APIs to enrich some data, are writing classic back-ends exposing a REST API etc.
Kafka and Kafka Streams bring a lot of concepts and change how we think about our data and our workflow:
- Our data are immutable
- How do we handle DELETE or UPDATE?
- How do we handle GDPR?
- We use reactive programming
- We think about back-pressure
- In Kafka Streams, we avoid calling APIs (KS only deals with synchronous calls today, it does not offer asynchronous processing) and we prefer to join topics. It’s more idiomatic to Kafka and we don’t depend on third-party SLAs. Moreover, we provide idempotency. What’s the impact?
- We don’t need a REST API everywhere (Kafka Streams without IQ). No need to put Spring everywhere.
- How do we expose our data to other services? or to some Java backend? (hint: Kafka Connect)
- How do we do a SELECT WHERE?
- How do we ensure our data are safe? How can we replicate them?
The model of thinking is very different and generally offer a “better” separation and decoupling than what most developers and architects are used to.
We encapsulate enrichment process, joins, business rules, in distinct Kafka Streams processors that only subscribe to some topics and publish into other topics and that’s it. No need of any REST API here. No need to call any third-party APIs. From topics to topics (and with exactly-once processing).
Separation also brings more complexity: we have more small pieces to manage now. More failures can occur, more services need to be deployed. We have more (uncontrolled?) side-effects on our machines and our brokers. We can’t respond to ad-hoc queries as we did before with a good ol’ big Oracle database (except if we sink our topic into it, but it has to be done).
Hopefully, we now have real-time results and aggregations for our use-cases. We also have a better separation of concerns, we can evolve easily the small services without affecting the whole stack. We can have multiple versions of a service running side-by-side. We can scale easily. Finally, we are free of any overloads because Kafka acts as a buffer.
And last but not least, we have the ability to “reset” the consumption and restart a consumer from the origin of the topics to reprocess all the events. It’s a common strategy (to fill a database, to send everything to a new third-party, to rebuild some state…)
Our classic databases are not the single of truth: Kafka is.
It’s not always desirable but it can be (not everybody uses Event Sourcing).
Kafka Streams brings its own complexity
Do you get the picture above?
Kafka Streams is quite nice to work with (especially with the Scala wrapper because of a better type checking compilation), brings nice abstractions, but has its owns constraints and downsides too. Kafka Streams is just a library, it’s not a framework such as Spark or Flink where the “processors” are deployed on dedicated worker machines and monitored out of the box. There is no such thing with Kafka Streams: no free monitoring.
When writing Kafka Streams, we don’t write our consumers or producers, Kafka Streams deals with them internally. We only focus on the business value we need to add to our data through transformations and enrichments.
Like Kafka, Kafka Streams has a semantics to learn (KStream, KTable, duality, windowing strategies, joins, statefulness, exactly-once). It must be understand if we don’t want to make mistakes or have uncontrolled side-effects. It’s not because the program compiles that it does what we think.
With Kafka Streams, we have the same power as databases: we can (inner/left/outer) join topics! … upon certain conditions. For instance, we can’t just join 2 infinite KStreams like that, we need to provide a windowing strategy. But no need if we join KStreams with KTables, it works without it because it makes more sense. The concept of windowing can be delicate to assimilate, how the data are going to be processed (tumbling, hopping, sliding, session, unlimited windows).
When joining, both topics need the same keys and partition count, we can’t just join on any field (the value is totally abstract for Kafka Streams). If we don’t have the same structure, we’ll need to “copy” one topic to a new repartitioned/rekeyed topic for the join to work. Yes, we’re going to duplicate data for technical reasons (hence more partitions in the brokers).
We must ensure our KTable keyspaces are stable in size, otherwise they will grow forever and our world will collapse. It’s also possible to inject pseudo-DELETE upstream (key:null) to remove the data we want from our KTable. It’s a bit annoying to handle deletions this way, but that’s the way for stateful streaming processing.
Kafka Streams API is great because it abstracts a lot of complexity. Still, we need to understand this complexity, because Kafka Streams create tons of new topics (changelogs, repartitions) and databases under the hood to keep some state. impacts on the machine hosting the applications (cpu/mem/IOs pressure) but also on the brokers (more partitions) must be planned.
The deployment strategy is also impacted: now that each of our application is stateful, we need to handle the state of the application (it’s like a distributed database). When we are dealing with application containerization, we have the choice to ignore this state, Kafka Streams will rebuild it from scratch—that will takes some time during which the application won’t be available—except if we have standby streamers replicas.
Finally, Kafka Streams API can be misleading and causes subtle bugs after refactoring (like referential transparency issues with .transform, no I don’t read the doc, I just read the types). We have to deal with new exceptions on runtime, that can be very very complicated to interpret, understand and fix (hello UNKNOWN_PRODUCER_ID, related to the transactions).
Avro is a first-class citizen
My friend Avro. Avro is a good guy, full of resources, and can adapt to everyone, but we have to know how to speak to Avro.
One day, when working with Kafka, we’ll definitely want to use Avro instead of our mighty JSON. Avro is “just” a serialization format, nothing to worry about. It offers a very dense binary serialization, save tons of space, release pressure on the network, is faster to serialize/deserialize, handle schema compatibility, all great points we all need. Again, Avro comes with its own API, its constraints, subtleties, and tools.
With Kafka, it comes with a Schema Registry: another application to expose the schemas of the data, because they are not contain in the messages themselves. It’s another application that must be highly-available. It becomes a critical part of the architecture. Without it, we are not able to publish or consume Avro data anymore.
The Governance around Kafka
A last point I raised was around the future of a platform/company relying upon Kafka: what about its governance?
Is Kafka going to stay private within the service, like an implementation detail, or is this going to be exposed to other services in the company?
If yes, do we have some common enterprise data model everyone agrees on? How are we going to name the topics? Does each country have its own cluster, or will we mutualize resources instead? How are we going to audit the accesses and deal with ACLs?
There are multiple strategies when dealing with Kafka in a company:
- Either we provide a big fat cluster with a good operational team, providing a proper security with ACLs and let the company services produces/consumes to/from it.
- Or we can also have several Kafka, each of them managed by the service that needs it, and expose its data through other means (like sinking to a DB and expose some REST API).
- Finally, we could go with a hybrid solution: a large cluster in the company and tiny clusters managed by each services. Then you can mirror only certain topics from the small clusters to the large secured cluster.
OK, that was a bit more longer than I expected! I told you, I know a thing or two. Still, I hope you learned something.
If you’re not already well-versed into Kafka, I hope your mindset has been a bit “updated” to consider all those facts, strategies and impacts Kafka can have in a company. Using Kafka must clearly be well thought ahead and it’s not just a roll dice. It comes with several softwares that must be combined together to enjoy the power of Kafka.
Once you understand it and have your mindset bent, then it’s a delicious pleasure to work with it. I wouldn’t come back!
Kafka is a platform, a strategy and a mindset, altogether. '