Many algorithms, especially those with high computational complexity or those working with large amounts of data may take a long time to complete. Many different ways to express algorithms exist in different environments - single threaded, parallel and concurrent and distributed. In this blog post I will focus on the relationship between them and the advantages and disadvantages that the distributed environment provides. The main focus will be on Apache Spark and the optimisation techniques it applies to computations defined by its users in distributed environment.
The demand for stream processing is increasing. Immense amounts of data have to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.
The machine learning pipelining API for Apache Spark was released in December 2014 in version 1.2 . The available resources , ,  or  only present the same simple examples. But how does it work in practice, what are the strengths and weaknesses and is ready for production use? This blog post will try to answer these questions.