Hadoop Learning | Chitrank Verma : January 2014

Tuesday, 21 January 2014

Shuffle and Sort In Mapreduce

For Brief Explanation refer below link

https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-6/shuffle-and-sort

· For small clusters in which all servers are connected by a single switch, there are only two levels of locality: "on-machine" and "off-machine." When loading data from a DataNode's local drive into HDFS, the NameNode will schedule one copy to go into the local DataNode, and will pick two other machines at random from the cluster.

· For larger Hadoop installations which span multiple racks, it is important to ensure that replicas of data exist on multiple racks. This way, the loss of a switch does not render portions of the data unavailable due to all replicas being underneath it.

· HDFS can be made rack-aware by the use of a script which allows the master node to map the network topology of the cluster. While alternate configuration strategies can be used, the default implementation allows you to provide an executable script which returns the "rack address" of each of a list of IP addresses.

· The network topology script receives as arguments one or more IP addresses of nodes in the cluster. It returns on stdout a list of rack names, one for each input. The input and output order must be consistent.

· To set the rack mapping script, specify the key topology.script.file.name in conf/hadoop-site.xml. This provides a command to run to return a rack id; it must be an executable script or program. By default, Hadoop will attempt to send a set of IP addresses to the file as several separate command line arguments. You can control the maximum acceptable number of arguments with the topology.script.number.args key.

· Rack ids in Hadoop are hierarchical and look like path names. By default, every node has a rack id of /default-rack. You can set rack ids for nodes to any arbitrary path, e.g., /foo/bar-rack. Path elements further to the left are higher up the tree. Thus a reasonable structure for a large installation may be /top-switch-name/rack-name.

· Hadoop rack ids are not currently expressive enough to handle an unusual routing topology such as a 3-d torus; they assume that each node is connected to a single switch which in turn has a single upstream switch. This is not usually a problem, however. Actual packet routing will be directed using the topology discovered by or set in switches and routers. The Hadoop rack ids will be used to find "near" and "far" nodes for replica placement (and in 0.17, MapReduce task placement).

Fair Scheduler in Hadoop

Fair Scheduler

· The core idea behind the fair share scheduler was to assign resources to jobs such that on average over time, each job gets an equal share of the available resources. The result is that jobs that require less time are able to access the CPU and finish intermixed with the execution of jobs that require more time to execute. This behaviour allows for some interactivity among Hadoop jobs and permits greater responsiveness of the Hadoop cluster to the variety of job types submitted. The fair scheduler was developed by Facebook.

· The Hadoop implementation creates a set of pools into which jobs are placed for selection by the scheduler.

Each pool can be assigned a set of shares to balance resources across jobs in pools (more shares equals greater resources from which jobs are executed). By default, all pools have equal shares, but configuration is possible to provide more or fewer shares depending upon the job type.

The number of jobs active at one time can also be constrained, if desired, to minimize congestion and allow work to finish in a timely manner.

· To ensure fairness, each user is assigned to a pool. In this way, if one user submits many jobs, he or she can receive the same share of cluster resources as all other users (independent of the work they have submitted). Regardless of the shares assigned to pools, if the system is not loaded, jobs receive the shares that would otherwise go unused (split among the available jobs).

· The scheduler implementation keeps track of the compute time for each job in the system. Periodically, the scheduler inspects jobs to compute the difference between the compute time the job received and the time it should have received in an ideal scheduler. The result determines the deficit for the task. The job of the scheduler is then to ensure that the task with the highest deficit is scheduled next.

Capacitive Scheduling in Hadoop

Capacity Scheduler

· The capacity scheduler shares some of the principles of the fair scheduler but has distinct differences, too.

· First, capacity scheduling was defined for large clusters, which may have multiple, independent consumers and target applications. For this reason, capacity scheduling provides greater control as well as the ability to provide a minimum capacity guarantee and share excess capacity among users. The capacity scheduler was developed by Yahoo!.

· In capacity scheduling, instead of pools, several queues are created, each with a configurable number of map and reduce slots. Each queue is also assigned a guaranteed capacity (where the overall capacity of the cluster is the sum of each queue's capacity).

· Queues are monitored; if a queue is not consuming its allocated capacity, this excess capacity can be temporarily allocated to other queues. Given that queues can represent a person or larger organization, any available capacity is redistributed for use by other users.

· Another difference of fair scheduling is the ability to prioritize jobs within a queue. Generally, jobs with a higher priority have access to resources sooner than lower-priority jobs. The Hadoop road map includes a desire to support pre-emption (where a low-priority job could be temporarily swapped out to allow a higher-priority job to execute), but this functionality has not yet been implemented.

· Another difference is the presence of strict access controls on queues (given that queues are tied to a person or organization). These access controls are defined on a per-queue basis. They restrict the ability to submit jobs to queues and the ability to view and modify jobs in queues.

Hadoop Commands


No	Command	Command Usage	Description
1	cat	hadoop dfs -cat <path>	prints the file contents
2	chgrp	hadoop dfs -chgrp [-R] GROUP URI [URI …]	Change group association of files. With -R, make the change recursively through the directory structure. The user must be the owner of files, or else a super-user.
3	chmod	hadoop dfs -chmod [-R] <MODE[,MODE]... \| OCTALMODE> URI
4	chown	hadoop dfs -chown [-R] [OWNER][:[GROUP]] URI [URI ]
5	copyFromLocal	hadoop dfs -copyFromLocal <localsrc> URI
6	copyToLocal	hadoop dfs -copyToLocal [-ignorecrc] [-crc] URI
7	count	hadoop dfs -count [-q] <paths>
8	cp	hadoop dfs -cp URI [URI …] <dest>
9	du	hadoop dfs -du URI [URI …]	Get size of each file in dir
10	dus	hadoop dfs -dus URI [URI …]	Gets total file size
11	expunge	hadoop dfs -expunge	Empty the Trash
12	get	hadoop dfs -get [-ignorecrc] [-crc] <src> <localdst>
13	getmerge	hadoop dfs -getmerge <src> <localdst> [addnl]	Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally addnl can be set to enable adding a newline character at the end of each file
14	ls	hadoop dfs -ls path
15	lsr	hadoop dfs -lsr <args>	Recursive version of ls. Similar to Unix ls -R
16	mkdir
17	moveFromLocal
18	moveToLocal
19	mv
20	put
21	rm
22	rmr	hadoop dfs -rmr [-skipTrash] URI [URI …]
23	setrep	hadoop dfs -setrep [-R] <path>	Changes the replication factor of a file. -R option is for recursively increasing the replication factor of files within a directory.
24	stat	hadoop dfs -stat URI [URI …]
25	tail	hadoop dfs -tail [-f] URI	Displays last kilobyte of the file to stdout. -f option can be used as in Unix
26	test	hadoop dfs -test -[ezd] URI	e check to see if the file exists. Return 0 if true. z check to see if the file is zero length. Return 0 if true. -d check to see if the path is directory. Return 0 if true
27	text	hadoop dfs -text <src>	Takes a source file and outputs the file in text format
28	touchz	hadoop dfs -touchz URI [URI …]	Create a file of zero length.
29	jar	hadoop jar	The hadoop jar command runs a JAR file.
30	fsck	hadoop fsck	HDFS supports the fsck command to check for various inconsistencies. It it is designed for reporting problems with various files, for example, missing blocks for a file or under-replicated blocks.
31	job	hadoop job -list

Basic Linux Commands

Commands in linux

Command	Meaning
ls	Will list the items in files
ll	Will list the items vertically
cat	Will help to read the file mentioned in command.
cd ..	Will one directory back
diff	print line difference in files. Not the entire string
grep	grep 'abc ' /chitrank/txt . Search abc text in txt files
grep -v	grep -v 'abc' /chitrank/txt . Not condition of above grep
grep -i	ingnore case
grep -w	to search on given word
*cat /chitrank/input/ \| grep -i 'a'**	Grep commands with pipe
*cat /chitrank/input/ \| grep -l 'a'**	to list the file with given word
grep --color 'a'	Will show matches in color
mkdir	will make directory
cd /folder/file	will move to that file directory
awk	find or replace text
chmod	change the access of files or directory // fullacess chmod ug+rwx file.txt // revoke chmod g-rwx file.txt // give access chmod -R ug+rwx file.txt
chown	change user and group ownership
adduser	add a new user
echo	display message on console
cnp	comapre two files
mv	move file
passwd	modify user password
pwd	print working dir
tar	to
useradd	add new user account
nano	to open editor and make files
sort	to sort file
unzip	unzip a flie
top	find the programs running
kill	kill a task
man	diplays main page
tail	print last lines in file
less	prints less number of outputs
su	switch to other user. Root can switch widout password
yum	to install apche using yum
rpm	to install apche using rpm
date	prints date
get	get file from other loc
put	put file to other loc
find	find files using file name
ssh	remote accesss
cut	cut some part from files
tr	replaces some pattern and changes file output format // tr -s '\|' ',' < Ret_MediaTP \| cut -f1-28 -d, > MediaTp__Ret.txt

	To find File name
	find . -type f -printf "%f\n" > /home/hduser1/chitrank/freshseq

Hadoop Learning | Chitrank Verma

Tuesday, 21 January 2014

Shuffle and Sort In Mapreduce

Rack Awareness in hadoop

Fair Scheduler in Hadoop

Capacitive Scheduling in Hadoop

How does a Single file get stored in HDFS

MapReduce Architecture !

Hadoop Basic Commands

Basic Linux Commands

Blog Archive

Wikipedia