JOSÉ RAMÓN HERVÁS | 08/02/2019
The vast majority of us who are dedicated to Big Data will have begun in this field after hearing talk of the famous Vs. There were originally three: volume, variety and velocity; which turned into four with the addition of value, thanks to Machine Learning or Artificial Intelligence. And in some places they are already talking about as many as 10, what a challenge! Don't worry, I’m not going to explain them again.
Of the three initial Vs, which led to this new trend or technology, the one which has undergone the greatest change over the last few years is velocity. In the first Apache Hadoop platforms, the idea of velocity was linked to, in the case of having to accelerate processing (oh, those map and reduce functions, good times!), increasing the number of nodes to spread the load, and therefore, to be able to process the information in less time.
Different projects on Hadoop, or what is known as its ecosystem, emerged to try to solve this type of requirement. One example is HBase, a key-value NoSQL database built on the Hadoop HDFS that facilitated access to and/or writing of data in real time thanks to its low latency. It is true that this resolved certain issues such as checking metrics or KPIs in real time that could be shown afterwards in a scorecard. However, it wasn’t and isn’t enough.
Thus, large corporations began internal projects that would end up becoming public and turning into leading Apache projects. As an example, in the year 2011 we have LinkedIn and Apache Kafka. This is a storage system with minimal latency which enables the gathering of any type of information while guaranteeing the availability of the data, fault tolerance and scalability of the platform. This was in its early days, because now it has its own ecosystem around it: Kafka Streams, Schema Registry... (I’m loving it).
At the same time, other projects emerged within the framework of research. H2020 projects such as Stratosphere which were intended to develop the next generation analytical or streaming processing tools. In 2014 this project would end up becoming the technology that we know today as Apache Flink.
This way, with different tools of one type or another being available, what is known as the Lambda architecture ended up being defined toward the end of 2013. This architecture mixes batch processing with a velocity layer to help solve the problems related to this requirement.
However, this architecture has several issues, the main one being: is it necessary to keep, with all that this entails, the code of two complex systems that are distributed and that must obtain the same result?
Why not improve the system as a whole and process all information as a data stream? That is how the Kappa architecture emerged around the year 2014.
This is comprised of, in the first instance, a storage layer, Apache Kafka, which as well as continuing to gather data, is flexible when loading data sets of which may be reprocessed as many times as necessary afterwards.
There is a second analytical or streaming processing layer, i.e. Apache Flink, which supports the handling of asynchronous information, in other words, differentiating between the moment in which the information is generated, event time, the moment in which it is received by our systems, ingestion time, and the moment in which we are processing it, processing time.
And lastly, there is a service layer which shows the results or the processed information, as well as the original information or raw data. Here there is greater freedom when it comes to having to choose a technology or tool, and in fact several may co-exist, with each resolving a specific need. In the case, for example, of studying the relationships among our customers, we choose a graph NoSQL database. In contrast, when tracking our goods or stock, where we must obtain the history of its location and/or hierarchy, we may recommend the use of a document-based NoSQL. And if we have to measure and recover KPIs for our business, we would most likely choose a key-value database, and preferably in-memory (no, don’t worry, not HBase, but for example a Redis).
At Treelogic, from almost the very start, since the middle of 2015, we have been committed to this type of architecture in our research projects, in order to then transfer our knowledge to customers. We are currently focused on streamlining our deployments thanks to the emerging orchestrators (Kubernetes) of images and containers (Docker).
TREELOGIC BIG DATA ARCHITECTURES
The millions of pieces of data that are currently generated in the digital age would be of no use without systems to channel all that information. The group of technologies that enables the mass processing of this data set is what is known as Big Data.
THE TREELOGIC APPROACH: WE DEAL WITH DATA
One of Treelogic’s main objectives, in all of our projects, is to help the client discover how data can add value to their business. Identifying and exploiting the competitive advantage within any sector is fundamental in order to achieve the best market position.
INDUSTRY 4.0, THE LATEST REVOLUTION
Big data, Artificial Intelligence (AI), Machine Learning, Deep Learning, artificial vision or automation are trending terms and form part of the latest socioeconomic movement of our time, the fourth industrial revolution. A change that is already transforming production processes and that affects our daily lives.