The demand for stream processing is increasing. Immense amounts of data have to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.
A number of powerful, easy-to-use open source platforms have emerged to address this. But the same problem can be solved differently, various but sometimes overlapping use-cases can be targeted or different vocabularies for similar concepts can be used. This may lead to confusion, longer development time or costly wrong decisions.
In this blog post series I’m going to describe various state of the art open-source distributed streaming frameworks, their similarities and differences, implementation trade-offs and their intended use-cases. Apart from that, I’m going to describe Fast Data, theory of streaming, framework evaluation and so on. My goal is to provide comprehensive overview about modern streaming frameworks and to help fellow developers with picking the correct one for their use-cases.
In this post, I’ll start with basic introduction into this area. In the following posts, I’m going to discuss various distributed streaming frameworks like Spark Streaming, Storm, Flink, Samza, Heron, etc. If you have any recommendations or tips then don’t hesitate to drop me an email and I’ll be more then happy to have a look.
Big Data and Fast Data
Big Data is one of the most popular terms nowadays, but Big Data is not only about the volume. Much of the data is received in real time and is most valuable at the time of arrival. For example, we want to detect shares market trends as soon as possible; a service operator wants to detect failures from logs within a seconds; and a news site may want to train their model to show users content which they are interesting in.
Big data are often characterised like Volume + Velocity + Variety. Volume describes the quantity of data, it is the size of the data which determines the value and potential of the data. Variety is the next aspect of Big Data. It describes range of data types and sources. The term ‘velocity’ in the context refers to the speed of generation of data or how fast the data is generated and processed to meet the demands. Sometimes, Big Data are characterised also with Variability (a factor which can be a problem for those who analyse the data), Veracity (describes accuracy of analysis) or Complexity (how complex is to process the data), but for our needs, we are going to stick with original 3V definition. This is shown on picture below.
We are going to focus on stream processing or sometimes referenced as fast data, where velocity is the key.
The Definition of Stream Processing
‘Streaming' is a commonly used word and can be used very widely. In the context of application development, streaming could reference various implementations of lazy lists - which is definitely a helpful feature - but we’re not interested it in right now. Also, streams, especially these days, often refer to Akka's implementation of Reactive Streams. From a quick glance, Reactive Streams seem to fulfill the use-case we want. Unfortunately, with one very significant flaw - they are not, at least for now, distributed which means we are stuck on a single JVM. The ability to distribute computation on multiple nodes is crucial for us. Apart from that, we’re going to consider open-source solutions only. So, that was the very last limitation and let’s start with theory now.
There are three main approaches how to process data - Batch Processing, Stream Processing and Micro-Batching.
- generally familiar concept of processing data en masse
- has access to all data
- might compute something big and complex
- more concerned with throughput that latency
- higher latencies (even in minutes)
- typical example is Hadoop’s MapReduce
- a one-at-a-time processing model
- data are processed immediately upon arrival
- computations are relatively simple and generally independent
- sub-second latency
- difficult to maintain state efficiently
- typical example is Apache Storm Core
- a special case of batch processing with very small batch sizes
- mix between batching and streaming
- latency in seconds
- easier windowing and stateful computation
- typical example is Spark Streaming
Processing high velocity data has 2 different processing use cases - Distributed Stream Processing (SP) or sometimes called Event Stream Processing (ESP) and Distributed Complex Event Processing (CEP). For our needs I’ll define them as follows.
SP/ESP is a stateless, straight through processing of incoming data in a distributed fashion using ‘continuous queries’ (basically queries that forever process arriving data with given parameters). This is happening without any I/O or data storage. Data streams can transformed, joined, filtered, etc. according to defined topology. Final state is usually stored somewhere for further processing. Typical use case is a the ability to continuously calculate mathematical or statistical analytics on the fly within the stream. For example, it could be notification of ascending/descending trend when showing stock performance. Stream processing solutions are designed to handle high volume in real time with a scalable, highly available and fault tolerant architecture. This enables analysis of data in motion.
CEP is a stateful, micro or mini batched complex processing where state maintenance is involved. State is persisted, it does not have to be necessarily disk, could be held in-memory or somewhere else. CEP engines are optimised to process discrete events in defined topologies. CEP outcome could be literally anything - feeding another system with calculated data, persisting value information or just storing a data for further processing. Classical example can be Twitter’s trending tweets according to defined hashtag for a given time window.
We cannot take these approaches as distinctive. CEP system can be build on top of an ESP system and in some areas the terms could be even used interchangeably or can be defined a little bit differently. And even with our definition it is very easy to find systems which I’m fulfill both. Apache Storm Core is ideal for ESP, but with Trident abstraction fulfills CEP.
The Eight Requirements for Stream Processing
This very interesting paper outlines eight requirements that a system should meet to excel at a variety of real-time system processing applications. Goal is to provide high-level guidance what to look for when evaluating stream processing solutions. If you don’t want to read the whole paper, here is the summary of paper’s core idea - requirements for stream processing:
Keep the Data Moving
The first requirement for a real-time stream processing system is to process messages “in-stream”, without any requirement to store them to perform any operation or sequence of operations. Ideally the system should also use an active (i.e., non-polling) processing model.
Query using SQL on Streams
The second requirement is to support a high-level “StreamSQL” language with built-in extensible stream oriented primitives and operators.
Handle Stream Imperfections (Delayed, Missing and Out-of-Order Data)
The third requirement is to have built-in mechanisms to provide resiliency against stream "imperfections", including missing and our-of-order data, which are commonly present in real-world data streams.
Generate Predictable Outcomes
The fourth requirement is that a stream processing engine must guarantee predictable and repeatable outcomes.
Integrate Stored and Streaming Data
The fifth requirement is to have the capability to efficiently store, access, and modify state information, and combine it with live streaming data. For seamless integration, the system should use a uniform language when dealing with either type of data.
Guarantee Data Safety and Availability
The sixth requirement is to ensure that the applications are up and available, and the integrity of the data maintained at all times, despite failures.
Partition and Scale Applications Automatically
The seventh requirement is to have the capability to distribute processing across multiple processors and machines to achieve incremental scalability. Ideally, the distribution should be automatic and transparent.
Process and Respond Instantaneously
The eighth requirement is that a stream processing system must have a highly-optimized, minimal overhead execution engine to deliver real-time response for high-volume applications.
The rules serve to illustrate the necessary features required for any system that will be used for high-volume low-latency stream processing applications
That’s all folks for today. We went through the necessity theory and in the next post we will have a look on a first of many available frameworks - Apache Spark Streaming. Hope you find it interesting and stay tuned.
Stream processing has a great future and will become very important for most companies. Big Data and Internet of Things are huge drivers of this change.