A Decade into Big Data

December 14, 2017
sholt
Articles

2016 marked the 10-year anniversary of Hadoop, a name closely associated with “Big Data.” Prior to the advent of Big Data, companies invested in solutions that were not forward-looking; they could only address the immediate needs of businesses. These traditional solutions were way too expensive, especially considering their very limited capabilities.

The data landscape then was quite different from what it is today. Significant upfront investments were required to handle just a few dozens terabytes. Scaling was an issue, as most solutions incorporated specialised hardware and were built with a scale-up rather than a scale-out approach. Things started changing with the emergence of multi-core processors, distributed storage and the rise of social media. Organisations which were driven purely by use cases, now started looking at things from the other end, “the Data.”

Big Data Era – 2008

The first major step towards this data oriented approach came in the form of Hadoop, the Data-Hungry Big Yellow Elephant by Doug Cutting. Born out of the research paper of Google, Hadoop introduced a new way of looking at data. Though the initiative started around 2006, it was not until 2008 that it was officially launched as an open-source Apache project. Yahoo, which hired Cutting, rolled out its two main components – Hadoop Distributed File System (HDFS), a cheaper alternative for distributed-storage, and Map-Reduce (MR), a very efficient way to parallelise computations.

So now we had an affordable tool that could solve some of the common technology problems like Search, Join, Merge, etc. Many organisations were struggling with these problems, and they could see immediate value out of Hadoop.

Banks and financial institutions which ran daily and monthly batch jobs, experienced major improvements as they were able reduce execution times to an hourly basis. Also, with social networking and digital marketing gaining traction, companies wanted to use these social data for more effective online campaigns, targeting and even recruitment, forming a strong business case for data sharing.

Business Intelligence teams started deriving better insights with Hadoop-based tools. With SQL being even today a popular choice among data and business analysts, Hadoop added an SQL-abstraction called Hive. This was a major development for accelerating Hadoop adoption, as organisations did not have to invest in specialised skills, and could rely on proved and tested techniques like SQL. Facebook was a major advocate of driving SQL on Hadoop and contributed to the Apache Hive project. Despite all this however, processing was still taking place in batch windows and execution times were in the order of tens of minutes.

The next major development during this period was the coming together of developers and statisticians. Data scientists were struggling to make their process-intensive workloads run on a single machine. The Apache Hadoop community developed Apache Mahout, a solution that could parallelise these algorithms. Mahout was essentially the data science driver, the one who tamed the data-elephant called Hadoop.

However, Mahoot required Java (a language used mainly by developers), whereas data scientists mostly used languages like R and Python. It was for this reason that data scientists started working together with developers to ensure their algorithms would migrate to Hadoop.

Although more and more companies began to realise the value of Hadoop, they were not ready to make the big shift due to its high dependency on the Open-Source community. Companies needed the kind of support and training that only a strong vendor could provide. It was then that Big Data vendors like Cloudera and Hortonworks entered the market. Born mainly out of the developers and contributors for the Apache Hadoop, these vendors provided companies the required support and tools to deploy and run a Hadoop platform.

In 2008 Cloudera was founded (Cutting joined them in 2009) offering a Hadoop-based platform incorporating open standards. In 2009, ex-Google employees started MapR with a vision of an enterprise-grade platform that would enable seamless Hadoop access and data management capabilities through a secure and reliable environment. Finally, in 2011, Hortonworks came to life as an open source solution funded by a Yahoo-led venture capital.

In parallel to this commercial evolution within the Big Data space, the Yahoo team (mostly consisting of the yet-to-be-Hortonworks) introduced the next big change to Hadoop, YARN (Yet Another Resource Negotiator). Hadoop was limited to only the problems that could fit into a discrete Map-Reduce paradigm. However, there were lot of real-life problems that would not fit into this paradigm. Hadoop 2 was developed as the successor of Map-Reduce, with YARN as its key component. This enabled running different types of workloads and use cases on Hadoop, no more limited by Map-Reduce. The fundamental idea of YARN was to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. This allowed multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform, unlocking an entirely new approach to analytics.

Another important contributor to the Big Data evolution has been Apache Spark, originally developed at UC Berkeley in 2009 to solve the performance limitations and increased costs associated with Hadoop’s disk-based storage approach. Spark is a powerful open source processing engine that enables memory-based storage. With the release of Hadoop 2, Spark was able to run on Hadoop, skyrocketing their usability by allowing to handle any type of workload – batch, real-time, and data science.

The next logical step was to take everything to the cloud and enhance affordability, scalability and flexibility. Amazon was already providing the capability to run Hadoop on the cloud with its Elastic-MapReduce (EMR) offering, whereas Microsoft joined the race with Azure HDInsight, a cloud distribution of Hadoop components powered by the Hortonworks Data Platform (HDP).

Data Lake Era – 2010

The term Data Lake was coined by James Dixon in 2010, and it refers to a single dumping ground for all the required data in its originality. Before the Hadoop disruption, the most popular option for data analysis was the Data Warehouse which employed ETL techniques. From the outset, the data would be extracted, quality-checked and cleansed by data stewards, transformed and aggregated by data analysts, and finally loaded into the system for the reporting needs of Business Intelligence (BI) teams. These pipelines will work efficiently as long as they all adhere to a single unified data model. This is the prerequisite for a typical Data Warehouse solution.

In the Hadoop-based approach, the data is stored in the original raw-format, without the need for any single unified schema. As we discussed earlier, Hadoop had already driven down storage costs, so organisations could dump a lot more data in raw format, rather than just their processed data (an approach difficult to achieve with Data Warehouse solutions due to their high storage costs).

These data, stored in their raw format, can be processed when they are really needed. This promotes a lot more usability, as the storage is no more limited to the use case that is being considered. Use cases can keep evolving and new insights derived at a later stage, without changing the data storage.

The Schema now changed from Schema-On-Write to Schema-On-Read and moved from ETL to ELT, which meant postponing the decision until the time the data are ready to be used rather than while being stored. This is a big gain for data science workloads, as data scientists prefer to experiment with data in its original form rather than being constrained by a strict schema.

The Data Warehouse solutions recognised this shift and started porting their platform to Hadoop. Today, we find most ETL tools, like those from Informatica or Talend, running on Hadoop. Organisations have a choice whether to move away from a Data Warehouse or use it together with a Hadoop platform. The most common pattern is to use Data Warehouse for the hot data that just arrived, and later dump it into a Hadoop-based Data Lake for long-term usage.

As SQL became a de-facto standard in Hadoop, many vendors built more powerful and mature tools on it, like Spark SQL, Apache Phoenix and Apache Drill. The platform itself became more robust and secure with additional components added to the stack, like Ranger, Knox, Sentry, Atlas etc. All these provided governance, auditing, authorisation and access privileges for the Hadoop platform. These new tools and components, powered by the Hadoop community, could potentially make Data Warehouse solutions obsolete.

Data Fabric – 2016

Data Fabric has become the latest buzzword, with most companies looking into a unified hardware/software solution. To make things clear right from the beginning, Data Fabric is not an application or a piece of software. It is a strategic approach towards data and storage. It is focused on how to store, manage, transfer and maintain data. This covers a much wider spectrum including but not limited to on-premise systems, offsite cloud hosted systems, data backups and archival, and other silos.

The pre-Data Fabric approach was looking at each data management platform as a closed environment with rules being applied within it. For organisations with multiple clusters, both on-premise and in the cloud, this approach is not fit-for-purpose. A Data Fabric is not just going to cross between the traditional, virtual, hybrid and cloud environments, but it is going to cross over different management platforms. Organisations can better plan and manage their data, without being limited to a single cluster view.

One of the better ways to manage it is by classifying the data into newly arrived data (Hot), data that is a few days/weeks old (Warm), and old archived data (Cold). These data can flow between systems and environments, transparently managed using global data rules that control the flow. The Big Data clusters are now getting mature enough to provide a multi-cluster and multi-environment view from a centralised access point.

In this data-centric world, where organisations discover more ways to connect and make use of the data, Data Privacy becomes a key concern and the Data Fabric approach addresses this concern. It is very difficult to keep track of how the data is flowing into and between different data platforms. This is where we would need a centralised security solution and data lineage, to understand precisely how and where the data is being used. Financial and healthcare organisations can make use of the Data Fabric approach to ensure they are complying with the data governance and standards.

What Next?

As we’ve moved through the various stages of big data — from the early Hadoop era to the data lake and data fabric eras — we find ourselves wondering what will come next. Here are a few predictions:

Industrial appliances would see a big growth with the adaption of Industry 4.0 utilising Sensors, Big Data, Augmented Reality (AR), Virtual Reality (VR), and Cloud Computation.
A big shift from batch to real-time has happened, with companies adapting event-driven approaches. Spark Streaming and Storm are being widely used for near real-time and event-processing. As sensors and related technologies evolve, the data is being streamed into Big Data clusters for real-time analysis.
Data Science and Artificial Intelligence tops the chart with many companies coming forward to build predictive models and scoring engines to understand their business better. Apache Spark Machine Learning, and Google Tensor Flow are being commonly used for predictive and deep-learning.

About the author: Raju Ramakrishna is a Big Data Architect working with WHISHWORKS Ltd in London, UK. In his 15 years of consulting and product development experience focusing on start-ups, he has helped many organisations exploit the power of Data. He is involved in many different areas of the Big Data landscape, with special interest in the Internet of Things (IoT), Data Science and Integration.

How the agriculture industry is being disrupted by Big Data

Databricks & Snowflake Integration