What is HDFS ?
The Hadoop distributed file system
(HDFS) is a distributed, scalable, and portable file-system written in Java for
the Hadoop framework. Each node in a Hadoop instance typically has a single
namenode; a cluster of datanodes form the HDFS cluster. The situation is
typical because each node does not require a datanode to be present. Each
datanode serves up blocks of data over the network using a block protocol
specific to HDFS. The file system uses the TCP/IP layer for communication.
Clients use Remote procedure call (RPC) to communicate between each other.
HDFS stores large files (typically in
the range of gigabytes to terabytes) across multiple machines. It achieves
reliability by replicating the data across multiple hosts, and hence does not
require RAID storage on hosts. With the default replication value, 3, data is
stored on three nodes: two on the same rack, and one on a different rack. Data
nodes can talk to each other to rebalance data, to move copies around, and to
keep the replication of data high. HDFS is not fully POSIX-compliant, because
the requirements for a POSIX file-system differ from the target goals for a
Hadoop application. The trade-off of not having a fully POSIX-compliant
file-system is increased performance for data throughput and support for
non-POSIX operations such as Append.
The HDFS file system includes a
so-called secondary namenode, which misleads some people into thinking that
when the primary namenode goes offline, the secondary namenode takes over. In
fact, the secondary namenode regularly connects with the primary namenode and
builds snapshots of the primary namenode’ s directory information, which the
system then saves to local or remote directories. These check pointed images
can be used to restart a failed primary namenode without having to replay the
entire journal of file-system actions, then to edit the log to create an
up-to-date directory structure. Because the namenode is the single point for
storage and management of metadata, it can become a bottleneck for supporting a
huge number of files, especially a large number of small files. HDFS
Federation, a new addition, aims to tackle this problem to a certain extent by
allowing multiple name-spaces served by separate namenodes.
An advantage of using HDFS is data
awareness between the job tracker and task tracker. The job tracker schedules
map or reduce jobs to task trackers with an awareness of the data location.
LIMITATIONS:-
HDFS was designed for mostly immutable files and
may not be suitable for systems requiring concurrent write-operations.
Another limitation of HDFS is that it cannot be
mounted directly by an existing operating system. Getting data into and out of
the HDFS file system, an action that often needs to be performed before and
after executing a job, can be inconvenient.