Hadoop Learning | Chitrank Verma

Tuesday, 20 May 2014

My Experience of Preparing and Clearing Cloudera Hadoop Developer Certification CCD-410

It feels ecstatic to complete this certification with decent score in first attempt. It was like those struggling days when we used to study hard for our final exams during college...

My preparation started as I joined a new company where i landed in hadoop . I had zero idea about hadoop and was worried how can i excel in this field.
As the project demanded Hadoop Mapreduce i first started developing basic MR codes.
Having basic experience in Java helped me , slowly and steadily by solving assignments and reading about hadoop ecosystem I started liking the subject .
Whatever i used to read or understand i would draw the concepts on a paper and publish this on my blog.
Blogging gave me a great booster and i developed more interest of learning by creating or reading blogs.
After gaining good 6 months of knowledge I decided to crack CCD-410 .
So some much needed tips : -

Have basic exp in Java. OOPS concepts , ETL background (optional)
Complete study of the Hadoop Definitive guide . Its the bible of Hadoop .
Clear understanding of all Hadoop Concepts practically and theoretically . Try so write as such MR code.. developing complex key, value pair !
Yahoo MR Tutorials. . Good for beginners .
Hadoop ecosystem is huge .. Try to have glimpse of sub projects like Hbase, Hive , pig , Flume ,Oozie , Sqoop.. Study what they do and where can they be useful .
Try to blog ! blogs are very important . Read as many blogs on hadoop . Best would be create your own BLOG !
If you are completely new to hadoop and require some training .. you will get many institute who provide it.. I feel Nixus Technologies is coming up well.
The exam contains the same pattern as discussed here Study Guide . Read the guide and understand which module holds importance .
If you guys feel weak in MR coding you can download developer simulator from Hadoop Developer Simulator and start solving. Its a very good software and exams questions are very much similar as mentioned in it.
Have belief and Trust in yourself and Crack It ! All the very Best !

Wednesday, 2 April 2014

Hadoop Distributed File System Architecture

What is HDFS ?

The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The file system uses the TCP/IP layer for communication. Clients use Remote procedure call (RPC) to communicate between each other.

HDFS stores large files (typically in the range of gigabytes to terabytes) across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX-compliant, because the requirements for a POSIX file-system differ from the target goals for a Hadoop application. The trade-off of not having a fully POSIX-compliant file-system is increased performance for data throughput and support for non-POSIX operations such as Append.

The HDFS file system includes a so-called secondary namenode, which misleads some people into thinking that when the primary namenode goes offline, the secondary namenode takes over. In fact, the secondary namenode regularly connects with the primary namenode and builds snapshots of the primary namenode’ s directory information, which the system then saves to local or remote directories. These check pointed images can be used to restart a failed primary namenode without having to replay the entire journal of file-system actions, then to edit the log to create an up-to-date directory structure. Because the namenode is the single point for storage and management of metadata, it can become a bottleneck for supporting a huge number of files, especially a large number of small files. HDFS Federation, a new addition, aims to tackle this problem to a certain extent by allowing multiple name-spaces served by separate namenodes.

An advantage of using HDFS is data awareness between the job tracker and task tracker. The job tracker schedules map or reduce jobs to task trackers with an awareness of the data location.

LIMITATIONS:-

HDFS was designed for mostly immutable files and may not be suitable for systems requiring concurrent write-operations.

Another limitation of HDFS is that it cannot be mounted directly by an existing operating system. Getting data into and out of the HDFS file system, an action that often needs to be performed before and after executing a job, can be inconvenient.

Friday, 28 February 2014

MapReduce Internal Flow

The below diagram consists of Internal Mapreduce Flow. It also explains The Inputformat and outputformat class. Reference The Hadoop Definitive Guide

Tuesday, 21 January 2014

Shuffle and Sort In Mapreduce

For Brief Explanation refer below link

https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-6/shuffle-and-sort

Rack Awareness in hadoop

Rack Awareness

· For small clusters in which all servers are connected by a single switch, there are only two levels of locality: "on-machine" and "off-machine." When loading data from a DataNode's local drive into HDFS, the NameNode will schedule one copy to go into the local DataNode, and will pick two other machines at random from the cluster.

· For larger Hadoop installations which span multiple racks, it is important to ensure that replicas of data exist on multiple racks. This way, the loss of a switch does not render portions of the data unavailable due to all replicas being underneath it.

· HDFS can be made rack-aware by the use of a script which allows the master node to map the network topology of the cluster. While alternate configuration strategies can be used, the default implementation allows you to provide an executable script which returns the "rack address" of each of a list of IP addresses.

· The network topology script receives as arguments one or more IP addresses of nodes in the cluster. It returns on stdout a list of rack names, one for each input. The input and output order must be consistent.

· To set the rack mapping script, specify the key topology.script.file.name in conf/hadoop-site.xml. This provides a command to run to return a rack id; it must be an executable script or program. By default, Hadoop will attempt to send a set of IP addresses to the file as several separate command line arguments. You can control the maximum acceptable number of arguments with the topology.script.number.args key.

· Rack ids in Hadoop are hierarchical and look like path names. By default, every node has a rack id of /default-rack. You can set rack ids for nodes to any arbitrary path, e.g., /foo/bar-rack. Path elements further to the left are higher up the tree. Thus a reasonable structure for a large installation may be /top-switch-name/rack-name.

· Hadoop rack ids are not currently expressive enough to handle an unusual routing topology such as a 3-d torus; they assume that each node is connected to a single switch which in turn has a single upstream switch. This is not usually a problem, however. Actual packet routing will be directed using the topology discovered by or set in switches and routers. The Hadoop rack ids will be used to find "near" and "far" nodes for replica placement (and in 0.17, MapReduce task placement).

Fair Scheduler in Hadoop

Fair Scheduler

· The core idea behind the fair share scheduler was to assign resources to jobs such that on average over time, each job gets an equal share of the available resources. The result is that jobs that require less time are able to access the CPU and finish intermixed with the execution of jobs that require more time to execute. This behaviour allows for some interactivity among Hadoop jobs and permits greater responsiveness of the Hadoop cluster to the variety of job types submitted. The fair scheduler was developed by Facebook.

· The Hadoop implementation creates a set of pools into which jobs are placed for selection by the scheduler.

Each pool can be assigned a set of shares to balance resources across jobs in pools (more shares equals greater resources from which jobs are executed). By default, all pools have equal shares, but configuration is possible to provide more or fewer shares depending upon the job type.

The number of jobs active at one time can also be constrained, if desired, to minimize congestion and allow work to finish in a timely manner.

· To ensure fairness, each user is assigned to a pool. In this way, if one user submits many jobs, he or she can receive the same share of cluster resources as all other users (independent of the work they have submitted). Regardless of the shares assigned to pools, if the system is not loaded, jobs receive the shares that would otherwise go unused (split among the available jobs).

· The scheduler implementation keeps track of the compute time for each job in the system. Periodically, the scheduler inspects jobs to compute the difference between the compute time the job received and the time it should have received in an ideal scheduler. The result determines the deficit for the task. The job of the scheduler is then to ensure that the task with the highest deficit is scheduled next.

Capacitive Scheduling in Hadoop

Capacity Scheduler

· The capacity scheduler shares some of the principles of the fair scheduler but has distinct differences, too.

· First, capacity scheduling was defined for large clusters, which may have multiple, independent consumers and target applications. For this reason, capacity scheduling provides greater control as well as the ability to provide a minimum capacity guarantee and share excess capacity among users. The capacity scheduler was developed by Yahoo!.

· In capacity scheduling, instead of pools, several queues are created, each with a configurable number of map and reduce slots. Each queue is also assigned a guaranteed capacity (where the overall capacity of the cluster is the sum of each queue's capacity).

· Queues are monitored; if a queue is not consuming its allocated capacity, this excess capacity can be temporarily allocated to other queues. Given that queues can represent a person or larger organization, any available capacity is redistributed for use by other users.

· Another difference of fair scheduling is the ability to prioritize jobs within a queue. Generally, jobs with a higher priority have access to resources sooner than lower-priority jobs. The Hadoop road map includes a desire to support pre-emption (where a low-priority job could be temporarily swapped out to allow a higher-priority job to execute), but this functionality has not yet been implemented.

· Another difference is the presence of strict access controls on queues (given that queues are tied to a person or organization). These access controls are defined on a per-queue basis. They restrict the ability to submit jobs to queues and the ability to view and modify jobs in queues.