In the previous post we were discussing reasons behind rising demands of stream processing and the theoretical introduction into the area. Today we are going to talk about Apache Spark Streaming. I want to focus on implementation trade-offs, their consequences and interesting issues we may face. Apart of that we are going to cover its intended use cases, available support and known production deployments. The post is all about Spark Streaming’s traits and not very obvious properties. No "hello worlds" today.
Graph data and graph processing is getting more and more attention lately in various fields. It has become apparent that a large number of real world problems can be described in terms of graphs, for instance, the Web graph, the social network graph, the train network graph and the language graph. Often these graphs are exceptionally huge, take the Web graph for example, it is estimated that the number of web pages may have exceeded 30 billion. We are in need of a system that is able to process these graphs created by modern applications.
Many algorithms, especially those with high computational complexity or those working with large amounts of data may take a long time to complete. Many different ways to express algorithms exist in different environments - single threaded, parallel and concurrent and distributed. In this blog post I will focus on the relationship between them and the advantages and disadvantages that the distributed environment provides. The main focus will be on Apache Spark and the optimisation techniques it applies to computations defined by its users in distributed environment.
The demand for stream processing is increasing. Immense amounts of data have to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.
The machine learning pipelining API for Apache Spark was released in December 2014 in version 1.2 . The available resources , ,  or  only present the same simple examples. But how does it work in practice, what are the strengths and weaknesses and is ready for production use? This blog post will try to answer these questions.