Hadoop – A leader to Big Data Analytics
What is Hadoop
- Hadoop is an open source framework for writing and running distributed applications that process large amount of data
- Key functionalities of Hadoop are
- Accessible : Hadoop runs on large clusters of commodity machines or on cloud computing services such as Amazon’s Elastic Compute Cloud (EC2)
- Robust : Because it is intended to run on commodity hardware, Hadoop can gracefully handle most of the frequent hardware malfunctions
- Scalable : Hadoop scales linearly to handle larger data by adding more nodes to the cluster
- Simple : Hadoop allows users to quickly write efficient parallel code
- The exponential growth of data first presented challenges to cutting-edge businesses such as Google, Yahoo, Amazon, and Microsoft.
- People upload videos, take pictures on their cell phones, text friends, update their Facebook status, leave comments around the web, click on ads, and so forth
- They needed to go through terabytes and petabytes of data to figure out
- which websites were popular, what books were in demand, and what kinds of ads appealed to people
- Existing tools were becoming inadequate to process such large data sets
Evolution of Hadoop
- Google was the first to publicize MapReduce—a system they had used to scale their data processing needs
- Doug Cutting later developed an open source version of this MapReduce system called Hadoop because it was not a feasible solution to reinvent their own proprietary tool for every organization
- Today, Hadoop is a core part of the computing infrastructure for many web companies, such as Yahoo , Facebook , LinkedIn , and Twitter
Hadoop – A distributed system
- Tie many low-end/commodity machines together as a single functional distributed system rather than building a bigger server which is not necessarily the best solution to large-scale problem
- Consider the price performance of current I/O technology. A high-end machine with four I/O channels each having a throughput of 100MB/sec will require 3 hours to read a 4TB data. With Hadoop, this same data set will be divided into smaller (typically 64 MB) blocks that are spread among many machines in the cluster via the Hadoop Distributed File System (HDFS ) and will take hardly 3 minutes to read.
- Hadoop avoids repeat transmission of data between client and server rather it focuses on moving code to data
- The clients send only the MapReduce programs to be executed
- The move-code-to-data philosophy applies within the Hadoop cluster itself
- Data is broken up and distributed across the cluster, and as much as possible
- Computation on a piece of data takes place on the same machine where that piece of data resides
- The programs to run (“code”) are orders of magnitude smaller than the data and are easier to move around
- It takes more time to move data across a network than to apply the computation to it
Hadoop vs SQL Database
SQL (structured query language) is by design targeted at structured data, whereas many of Hadoop’s initial applications deal with unstructured data such as text, so Hadoop provides a more general paradigm than SQL
- Scale-out instead of Scale-up
- To run a bigger database we need bigger machine and at some point of time we won’t have such big enough machine available for larger data set.
- Hadoop is designed to be a scale-out architecture operating on a cluster of commodity PC machines. We can add as many machines as needed. Hadoop clusters with ten to hundreds of machines are standard.
- Key/Value pairs instead of relational tables
- Data resides in tables having relational structure defined by a schema. Though the relational model has great formal properties but many data types such as text documents, images, and XML files that don’t fit well into this model. Also, large data sets are often unstructured or semi-structured.
- Hadoop uses key/value pairs as its basic data unit, which is flexible enough to work with the less-structured data types. So data can originate in any form and Hadoop transforms into (key/value) pairs for the processing functions to work on
- Functional Programming(MapReduce) instead of declarative queries(SQL)
- SQL is fundamentally a high-level declarative language and the database engine figures out how to derive data set when we run the query. Under SQL we have query statements.
- Under MapReduce we specify the actual steps in processing the data. Under MapReduce we have scripts and code.
- Offline batch processing instead of online transactions
- In online transaction processing, the type of load is the random reading and writing of a few records
- Hadoop is designed for offline processing and analysis of large-scale data. Hadoop is best used as a write-once, read-many-times type of data store.
What is MapReduce
- MapReduce, a data processing model, can easily scale the processing of a large data set over multiple computing nodes.
- Under the MapReduce model, the data processing primitives are called mappers and reducers.
- One of the great advantages, when an application in a MapReduce form gets scaled to run over hundreds, thousands, or even tens of thousands of machines in a cluster does require merely a configuration change.
Building blocks of Hadoop
On a fully configured cluster, “running Hadoop” means running a set of daemons, or resident programs, on the different servers in the network. These daemons have specific roles; some exist only on one server, some exist across multiple servers. The daemons include
- Hadoop employs a master/slave architecture for both distributed storage and distributed computation
- The distributed storage system is called the Hadoop File System, or HDFS
- The NameNode is the master of HDFS that directs the slave DataNode daemons to perform the low-level I/O tasks
- The NameNode is the bookkeeper of HDFS; it keeps track of how files are broken down into file blocks, which nodes store those blocks, and the overall health of the distributed filesystem
- Each slave machine in the cluster will host a DataNode daemon to perform the grunt work of the distributed filesystem
- It performs reading and writing HDFS blocks to actual files on the local filesystem
- A DataNode may communicate with other DataNodes to replicate its data blocks for redundancy
- DataNodes are constantly reporting to the NameNode
- Upon initialization, each of the DataNodes informs the NameNode of the blocks it’s currently storing
- Once the mapping is complete, the DataNodes continually poll the NameNode to provide information regarding local changes as well as receive instructions to create, move, or delete blocks from the local disk
- Secondary NameNode
- The Secondary NameNode (SNN) is an assistant daemon for monitoring the state of the cluster HDFS
- It communicates with the NameNode to take snapshots of the HDFS metadata at intervals defined by the cluster configuration
- As the NameNode is a single point of failure for a Hadoop cluster, and the SNN snapshots help minimize the downtime and loss of data
- Once the code is submitted to cluster, the JobTracker determines the execution plan by determining which files to process, assigns nodes to different tasks, and monitors all tasks as they’re running
- If a task fail, the JobTracker will automatically relaunch the task, possibly on a different node, up to a predefined limit of retries
- Each TaskTracker is responsible for executing the individual tasks that the JobTracker assigns
- Though there is a single TaskTracker per slave node, each TaskTracker can spawn multiple JVMs to handle many map or reduce tasks in parallel
- If the JobTracker fails to receive a heartbeat from a TaskTracker within a specified amount of time, it will assume the TaskTracker has crashed and will resubmit the corresponding tasks to other nodes in the cluster
- Hadoop avoids the costly transmission step when working with large data sets by using distributed storage and transferring code instead of data
- Hadoop using the MapReduce framework we don’t need to worry about partitioning the data, determining which nodes will perform which tasks, or handling communication between nodes