Martin Zapletal

Find me on:

Recent Posts

Posted by Martin Zapletal
Sun, Nov 20, 2016

Introduction

In this series of posts I will discuss the evolution of machine learning algorithms with regards to scaling and performance. We will start with a naive implementation and progress to more advanced solutions finally reaching state of the art implementations, similar to what companies like Google, Netflix and others use for their data pipelines, recommendation systems or machine learning. A variety of topics will be discussed, from basics of ML, different programming models, impact of distributed environment, specifics of machine learning algorithms as compared to common business applications and much more. For those not particularly interested in machine learning the concepts discussed are chosen carefully to apply to a wide range of applications and ML itself is chosen as a good example.

In my previous blog post we looked into neural networks, their training and investigated a trivial single threaded object oriented implementation. The result was a working example that was, however, not useful in many real world scenarios for its poor performance. With large amounts of data such approach is extremely wasteful and we can achieve vastly better performance through parallelization.

Posted by Martin Zapletal
Mon, Nov 14, 2016

Welcome to a new edition of #ThisWeekInScala!

This blog aims to keep you up to date with the latest news from the world of Scala and Reactive programming.

Posted by Martin Zapletal
Sat, Oct 1, 2016

Introduction

In this series of posts I will discuss the evolution of machine learning algorithms with regards to scaling and performance. We will start with a naive implementation and progress to more advanced solutions finally reaching state of the art implementations, similar to what companies like Google, Netflix and others use for their data pipelines, recommendation systems or machine learning. A variety of topics will be discussed, from basics of ML, different programming models, impact of distributed environment, specifics of machine learning algorithms as compared to common business applications and much more. For those not particularly interested in machine learning the concepts discussed are chosen carefully to apply to a wide range of applications and ML itself is chosen as a good example.

Although very old concepts, the importance of big data analytics and machine learning is steadily increasing. One of the reasons is improving accessibility of tools, decreasing prices and therefore the ability to access, store, process and use large amounts of data. And data are key for many use cases, from optimizing standard business use cases to finding and opening new business opportunities to completely transforming businesses.

Throughout this series of blog posts we will touch on many topics from machine learning, functional programming, parallel programming to distributed systems theory. I will start with a brief introduction into the different programming models, followed by abstract description of single machine, parallel and distributed computation, common data processing architectures, pipelines and technology stacks before getting to the actual focus of the blog post. Feel free to skip to chapter Perceptron if you want.

Posted by Martin Zapletal
Sun, Sep 27, 2015

Posted by Martin Zapletal
Tue, Sep 1, 2015

Welcome to new edition of #ThisWeekInScala!

This blog aims to keep you up to date with the latest news from the world of Scala and Reactive programming.

Posted by Martin Zapletal
Sun, Jul 19, 2015

Many algorithms, especially those with high computational complexity or those working with large amounts of data may take a long time to complete. Many different ways to express algorithms exist in different environments - single threaded, parallel and concurrent and distributed. In this blog post I will focus on the relationship between them and the advantages and disadvantages that the distributed environment provides. The main focus will be on Apache Spark and the optimisation techniques it applies to computations defined by its users in distributed environment.

Posted by Martin Zapletal
Wed, Jul 1, 2015

The machine learning pipelining API for Apache Spark was released in December 2014 in version 1.2 [1]. The available resources [2], [3], [4] or [5] only present the same simple examples. But how does it work in practice, what are the strengths and weaknesses and is ready for production use? This blog post will try to answer these questions.

Posted by Martin Zapletal
Fri, Apr 3, 2015

 

Posted by Martin Zapletal
Sun, Mar 8, 2015

Concepts such as event sourcing and CQRS allow an application to store all events that happen in the system using a persistence mechanism. The events can not be mutated and current state of the system in any point in history can be reconstructed by replaying all the events until that point. For performance reasons obviously the state can be cached using a snapshot. But the undisputable advantage of this approach is that the whole history of events (including user actions, behaviour or system messages - anything we decide to store) is available to us rather than just the current state. Event sourcing was thoroughly discussed before for instance in [1] or [2] and CQRS in [3], [4] or [5]

In this post we will discuss how we can store and further use these data by connecting Akka, Cassandra and Spark, focusing mostly on the configuration, Akka serialization and Akka-analytics project. Later I will follow up with another blog post building on top of this with an example of using machine learning techniques to obtain some insights to help optimize future decisions and application workflow.

Posted by Martin Zapletal
Sun, Feb 8, 2015

In one of my previous blog posts I introduced MLlib, Apache Spark's machine learning library. It discussed the basics of MLlib's api, machine learning vocabulary and linear regression http://www.cakesolutions.net/teamblogs/spark-mllib-linear-regression-example-and-vocabulary. Today I will have a bit deeper look at Spark's internals and the programming model - the options it provides to a programmer to implement and parallelise algorithms. I will demonstrate it on implementation of parallel pool adjacent violators solution to isotonic regression. The code was sent as a pull request to Spark and should be included in Spark 1.3 when it is released.

Posts by Topic

see all

Subscribe to Email Updates