BRIEF ON HADOOP
Ø Apache Hadoop is an open-source
software framework for storage and large scale processing of data-sets on
clusters of commodity hardware. Hadoop is an Apache top-level project being
built and used by a global community of contributors and users. It is licensed
under the Apache License 2.0.
Ø The Apache Hadoop framework is
composed of the following modules:
ü Hadoop Common
- contains libraries and utilities needed by other Hadoop modules
ü Hadoop Distributed File System (HDFS) - a distributed file-system that stores data on the
commodity machines, providing very high aggregate bandwidth across the cluster.
ü Hadoop YARN -
a resource-management platform responsible for managing compute resources in
clusters and using them for scheduling of users' applications.
ü Hadoop MapReduce - a programming model for large scale data processing.
Ø All the modules in Hadoop are
designed with a fundamental assumption that hardware failures (of individual
machines or racks of machines) are common and thus should be automatically
handled in software by the framework. Apache Hadoop's MapReduce and HDFS
components originally derived respectively from Google's MapReduce and Google
File System (GFS) papers.
Ø Beyond HDFS, YARN and MapReduce, the
entire Apache Hadoop “platform” is now commonly considered to consist of a
number of related projects as well – Apache Pig, Apache Hive, Apache HBase, and
others.
Ø Hadoop consists of the Hadoop Common
package, which provides filesystem and OS level abstractions, a MapReduce
engine (either MapReduce/MR1 or YARN/MR2) and the Hadoop Distributed File
System (HDFS). The Hadoop Common package contains the necessary Java Archive
(JAR) files and scripts needed to start Hadoop. The package also provides
source code, documentation and a contribution section that includes projects
from the Hadoop Community.
Ø For effective scheduling of work,
every Hadoop-compatible file system should provide location awareness: the name
of the rack (more precisely, of the network switch) where a worker node is. Hadoop
applications can use this information to run work on the node where the data
is, and, failing that, on the same rack/switch, reducing backbone traffic. HDFS
uses this method when replicating data to try to keep different copies of the
data on different racks. The goal is to reduce the impact of a rack power
outage or switch failure, so that even if these events occur, the data may
still be readable.
Ø A small Hadoop cluster includes a
single master and multiple worker nodes. The master node consists of a
JobTracker, TaskTracker, NameNode and DataNode. A slave or worker node acts as
both a DataNode and TaskTracker, though it is possible to have data-only worker
nodes and compute-only worker nodes. These are normally used only in
nonstandard applications. Hadoop requires Java Runtime Environment (JRE) 1.6 or
higher. The standard start-up and shutdown scripts require Secure Shell (ssh)
to be set up between nodes in the cluster.
Ø In a larger cluster, the HDFS is
managed through a dedicated NameNode server to host the file system index, and
a secondary NameNode that can generate snapshots of the namenode's memory
structures, thus preventing file-system corruption and reducing loss of data.
Similarly, a standalone JobTracker server can manage job scheduling. In
clusters where the Hadoop MapReduce engine is deployed against an alternate
file system, the NameNode, secondary NameNode and DataNode architecture of HDFS
is replaced by the file-system-specific equivalent
No comments:
Post a Comment