Kappa Architecture

BLOG

TREELOGIC, THE DIGITAL TRANFORMATION

JOSÉ RAMÓN HERVÁS | 08/02/2019

The vast majority of us who are dedicated to Big Data will have begun in this field after hearing talk of the famous Vs. There were originally three: volume, variety and velocity; which turned into four with the addition of value, thanks to Machine Learning or Artificial Intelligence. And in some places they are already talking about as many as 10, what a challenge! Don't worry, I’m not going to explain them again.

Real-Time Big Data — Real-time is an essential requirement in many use cases.

Title with Solid Background Color

Of the three initial Vs, which led to this new trend or technology, the one which has undergone the greatest change over the last few years is velocity. In the first Apache Hadoop platforms, the idea of velocity was linked to, in the case of having to accelerate processing (oh, those map and reduce functions, good times!), increasing the number of nodes to spread the load, and therefore, to be able to process the information in less time.

However, the reality or demands of modern business meant that many of the cases where this technology had been, or was going to be, implemented did not reach the production stage. The reason why is very simple. As an example, analysing an online transaction and returning a result can’t take more than mere milliseconds, and that’s without even touching on doing it in near-real-time. As today’s users, we have become accustomed to immediacy. We have normalised having to or wanting to know everything now, not a few minutes later... not even several second can pass.

Title with Solid Background Color

Different projects on Hadoop, or what is known as its ecosystem, emerged to try to solve this type of requirement. One example is HBase, a key-value NoSQL database built on the Hadoop HDFS that facilitated access to and/or writing of data in real time thanks to its low latency. It is true that this resolved certain issues such as checking metrics or KPIs in real time that could be shown afterwards in a scorecard. However, it wasn’t and isn’t enough.

Thus, large corporations began internal projects that would end up becoming public and turning into leading Apache projects. As an example, in the year 2011 we have LinkedIn and Apache Kafka. This is a storage system with minimal latency which enables the gathering of any type of information while guaranteeing the availability of the data, fault tolerance and scalability of the platform. This was in its early days, because now it has its own ecosystem around it: Kafka Streams, Schema Registry... (I’m loving it).

At the same time, other projects emerged within the framework of research. H2020 projects such as Stratosphere which were intended to develop the next generation analytical or streaming processing tools. In 2014 this project would end up becoming the technology that we know today as Apache Flink.

Title with Solid Background Color

This way, with different tools of one type or another being available, what is known as the Lambda architecture ended up being defined toward the end of 2013. This architecture mixes batch processing with a velocity layer to help solve the problems related to this requirement.

However, this architecture has several issues, the main one being: is it necessary to keep, with all that this entails, the code of two complex systems that are distributed and that must obtain the same result?

Kappa Architecture

Why not improve the system as a whole and process all information as a data stream? That is how the Kappa architecture emerged around the year 2014.

This is comprised of, in the first instance, a storage layer, Apache Kafka, which as well as continuing to gather data, is flexible when loading data sets of which may be reprocessed as many times as necessary afterwards.

There is a second analytical or streaming processing layer, i.e. Apache Flink, which supports the handling of asynchronous information, in other words, differentiating between the moment in which the information is generated, event time, the moment in which it is received by our systems, ingestion time, and the moment in which we are processing it, processing time.

And lastly, there is a service layer which shows the results or the processed information, as well as the original information or raw data. Here there is greater freedom when it comes to having to choose a technology or tool, and in fact several may co-exist, with each resolving a specific need. In the case, for example, of studying the relationships among our customers, we choose a graph NoSQL database. In contrast, when tracking our goods or stock, where we must obtain the history of its location and/or hierarchy, we may recommend the use of a document-based NoSQL. And if we have to measure and recover KPIs for our business, we would most likely choose a key-value database, and preferably in-memory (no, don’t worry, not HBase, but for example a Redis).

At Treelogic, from almost the very start, since the middle of 2015, we have been committed to this type of architecture in our research projects, in order to then transfer our knowledge to customers. We are currently focused on streamlining our deployments thanks to the emerging orchestrators (Kubernetes) of images and containers (Docker).

BLOG

TREELOGIC, THE DIGITAL TRANFORMATION

Title with Solid Background Color

Title with Solid Background Color

Title with Solid Background Color

Kappa Architecture

RELATED POSTS