Hadoop: what is it and how to learn to use it?

Hadoop

For several decades, companies mainly stored their data in the Relational DataBases Management System (RDBMS) in order to store them and perform queries. However, this type of database does not allow the storing of unstructured data and is not adapted to the vast volumes of Big Data.

Indeed, digitalization and the appearance of many technologies such as IoT and smartphones have caused an explosion in the volume of raw data. Faced with this revolution, new technologies are needed for data storage and processing. The Hadoop software framework allows us to respond to these new constraints.

What is Hadoop?

Hadoop is a software framework dedicated to the storage and processing of large volumes of data. It is an open source project, sponsored by the Apache Software Foundation .

It is not a product but a framework of instructions for the storage and processing of distributed data. Various software vendors have used Hadoop to create commercial Big Data management products.

Hadoop’s data systems are not limited in terms of scale, which means that more hardware and clusters can be added to support a heavier load without reconfiguration or the purchase of expensive software licenses.

The history of Hadoop

The origin of Hadoop is closely linked to the exponential growth of the World Wide Web over the last decade. The web has grown to include several billion pages. As a result, it has become difficult to search for information efficiently.

We have entered the era of Big Data . It has become more complex to store information efficiently for easy retrieval, and to process the data after storing it.

To address this problem, many open source projects have been developed. The goal was to provide faster search results on the web. One of the solutions adopted was to distribute the data and the calculations on a cluster of servers to allow simultaneous processing.

This is how Hadoop was born. In 2002, Doug Cutting and Mike Caferella of Google were working on the Apache Nutch open source web crawler project. They faced difficulties in storing the data, and the costs were extremely high .

In 2003, Google introduced its GFS file system : Google File System. It is a distributed file system designed to provide efficient access to data. In 2004, the American firm published a white paper on Map Reduce, the algorithm for simplifying data processing on large clusters. These publications of Google have strongly influenced the genesis of Hadoop.

Then in 2005, Cutting and Cafarella unveiled their new NDFS ( Nutch Distributed File System ) file system that also included Map Reduce. When he left Google to join Yahoo in 2006, Doug Cutting based on the Nutch project to launch Hadoop (whose name is inspired by a stuffed elephant of Cutting’s son) and its HDFS file system. The version 0.1.0 is relaxed.

Afterwards, Hadoop does not stop growing. It becomes in 2008 the fastest system to sort a terabyte of data on a cluster of 900 nodes in only 209 seconds. Version 2.2 was released in 2013, and version 3.0 in 2017.

In addition to its performance for Big Data , this framework has brought many unexpected benefits. In particular, it has brought down the cost of server deployment.