In this post I will give an introduction to Kafka, covering what it is and what you might use it for, and I'll then explain how I used it to decouple the two sides of a smart metering power data submission system. The goal is for the introduction to Kafka itself to be generally useful, while the specific approach shown will translate to some but not all Kafka uses.
In the previous post we went through the necessary theory and also introduced popular streaming framework from Apache landscape - Storm, Trident, Spark Streaming, Samza and Flink. Today, we’re going to dig a little bit deeper and go through topics like fault tolerance, state management or performance. In addition, we’re going to discuss guidelines when building distributed streaming application and also I’ll give you recommendations for particular frameworks.
A couple of months ago we were discussing the reasons behind increasing demand for distributed stream processing. I also stated there was a number of available frameworks to address it. Now it’s a time have a look at them and discuss their similarities and differences and their, from my opinion, recommended use cases.
In the previous post we were discussing reasons behind rising demands of stream processing and the theoretical introduction into the area. Today we are going to talk about Apache Spark Streaming. I want to focus on implementation trade-offs, their consequences and interesting issues we may face. Apart of that we are going to cover its intended use cases, available support and known production deployments. The post is all about Spark Streaming’s traits and not very obvious properties. No "hello worlds" today.
The demand for stream processing is increasing. Immense amounts of data have to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.