Tuesday, 21 January 2014

Shuffle and Sort In Mapreduce

Rack Awareness in hadoop


Rack Awareness

·        For small clusters in which all servers are connected by a single switch, there are only two levels of locality: "on-machine" and "off-machine." When loading data from a DataNode's local drive into HDFS, the NameNode will schedule one copy to go into the local DataNode, and will pick two other machines at random from the cluster.
·        For larger Hadoop installations which span multiple racks, it is important to ensure that replicas of data exist on multiple racks. This way, the loss of a switch does not render portions of the data unavailable due to all replicas being underneath it.
·        HDFS can be made rack-aware by the use of a script which allows the master node to map the network topology of the cluster. While alternate configuration strategies can be used, the default implementation allows you to provide an executable script which returns the "rack address" of each of a list of IP addresses.
·        The network topology script receives as arguments one or more IP addresses of nodes in the cluster. It returns on stdout a list of rack names, one for each input. The input and output order must be consistent.
·        To set the rack mapping script, specify the key topology.script.file.name in conf/hadoop-site.xml. This provides a command to run to return a rack id; it must be an executable script or program. By default, Hadoop will attempt to send a set of IP addresses to the file as several separate command line arguments. You can control the maximum acceptable number of arguments with the topology.script.number.args key.
·        Rack ids in Hadoop are hierarchical and look like path names. By default, every node has a rack id of /default-rack. You can set rack ids for nodes to any arbitrary path, e.g., /foo/bar-rack. Path elements further to the left are higher up the tree. Thus a reasonable structure for a large installation may be /top-switch-name/rack-name.
·        Hadoop rack ids are not currently expressive enough to handle an unusual routing topology such as a 3-d torus; they assume that each node is connected to a single switch which in turn has a single upstream switch. This is not usually a problem, however. Actual packet routing will be directed using the topology discovered by or set in switches and routers. The Hadoop rack ids will be used to find "near" and "far" nodes for replica placement (and in 0.17, MapReduce task placement).


Fair Scheduler in Hadoop


Fair Scheduler
·       The core idea behind the fair share scheduler was to assign resources to jobs such that on average over time, each job gets an equal share of the available resources. The result is that jobs that require less time are able to access the CPU and finish intermixed with the execution of jobs that require more time to execute. This behaviour allows for some interactivity among Hadoop jobs and permits greater responsiveness of the Hadoop cluster to the variety of job types submitted. The fair scheduler was developed by Facebook.

·       The Hadoop implementation creates a set of pools into which jobs are placed for selection by the scheduler.
Each pool can be assigned a set of shares to balance resources across jobs in pools (more shares equals greater resources from which jobs are executed). By default, all pools have equal shares, but configuration is possible to provide more or fewer shares depending upon the job type.
The number of jobs active at one time can also be constrained, if desired, to minimize congestion and allow work to finish in a timely manner.

·       To ensure fairness, each user is assigned to a pool. In this way, if one user submits many jobs, he or she can receive the same share of cluster resources as all other users (independent of the work they have submitted). Regardless of the shares assigned to pools, if the system is not loaded, jobs receive the shares that would otherwise go unused (split among the available jobs).





·       The scheduler implementation keeps track of the compute time for each job in the system. Periodically, the scheduler inspects jobs to compute the difference between the compute time the job received and the time it should have received in an ideal scheduler. The result determines the deficit for the task. The job of the scheduler is then to ensure that the task with the highest deficit is scheduled next.



Capacitive Scheduling in Hadoop


Capacity Scheduler
·        The capacity scheduler shares some of the principles of the fair scheduler but has distinct differences, too.

·        First, capacity scheduling was defined for large clusters, which may have multiple, independent consumers and target applications. For this reason, capacity scheduling provides greater control as well as the ability to provide a minimum capacity guarantee and share excess capacity among users. The capacity scheduler was developed by Yahoo!.

·        In capacity scheduling, instead of pools, several queues are created, each with a configurable number of map and reduce slots. Each queue is also assigned a guaranteed capacity (where the overall capacity of the cluster is the sum of each queue's capacity).


·        Queues are monitored; if a queue is not consuming its allocated capacity, this excess capacity can be temporarily allocated to other queues. Given that queues can represent a person or larger organization, any available capacity is redistributed for use by other users.

·        Another difference of fair scheduling is the ability to prioritize jobs within a queue. Generally, jobs with a higher priority have access to resources sooner than lower-priority jobs. The Hadoop road map includes a desire to support pre-emption (where a low-priority job could be temporarily swapped out to allow a higher-priority job to execute), but this functionality has not yet been implemented.


·        Another difference is the presence of strict access controls on queues (given that queues are tied to a person or organization). These access controls are defined on a per-queue basis. They restrict the ability to submit jobs to queues and the ability to view and modify jobs in queues.




How does a Single file get stored in HDFS

Here is a neat diagram showing how file is split and stored in HDFS


MapReduce Architecture !

Hello,

I feel better than reading books, Images speaks louder !!
Just have a complete look and then study from Hadoop Definitive Guide.


Hadoop Basic Commands

Hadoop Commands
No
Command
Command Usage
Description
1
cat
hadoop dfs -cat <path>
prints the file contents
2
chgrp
hadoop dfs -chgrp [-R] GROUP URI [URI …]
Change group association of files. With -R, make the change recursively through the directory structure. The user must be the owner of files, or else a super-user.
3
chmod
hadoop dfs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI
4
chown
hadoop dfs -chown [-R] [OWNER][:[GROUP]] URI [URI ]
5
copyFromLocal
hadoop dfs -copyFromLocal <localsrc> URI
6
copyToLocal
hadoop dfs -copyToLocal [-ignorecrc] [-crc] URI
7
count
hadoop dfs -count [-q] <paths>
8
cp
hadoop dfs -cp URI [URI …] <dest>
9
du
hadoop dfs -du URI [URI …]
Get size of each file in dir
10
dus
hadoop dfs -dus URI [URI …]
Gets total file size
11
expunge
hadoop dfs -expunge
Empty the Trash
12
get
hadoop dfs -get [-ignorecrc] [-crc] <src> <localdst>
13
getmerge
hadoop dfs -getmerge <src> <localdst> [addnl]
Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally addnl can be set to enable adding a newline character at the end of each file
14
ls
hadoop dfs -ls path
15
lsr
hadoop dfs -lsr <args>
Recursive version of ls. Similar to Unix ls -R
16
mkdir
17
moveFromLocal
18
moveToLocal
19
mv
20
put
21
rm
22
rmr
hadoop dfs -rmr [-skipTrash] URI [URI …]
23
setrep
hadoop dfs -setrep [-R] <path>
Changes the replication factor of a file. -R option is for recursively increasing the replication factor of files within a directory.
24
stat
hadoop dfs -stat URI [URI …]
25
tail
hadoop dfs -tail [-f] URI
Displays last kilobyte of the file to stdout. -f option can be used as in Unix
26
test
hadoop dfs -test -[ezd] URI
e check to see if the file exists. Return 0 if true.
z check to see if the file is zero length. Return 0 if true.
-d check to see if the path is directory. Return 0 if true
27
text
hadoop dfs -text <src>
Takes a source file and outputs the file in text format
28
touchz
hadoop dfs -touchz URI [URI …]
Create a file of zero length.
29
jar
hadoop jar
The hadoop jar command runs a JAR file.
30
fsck
hadoop fsck
HDFS supports the fsck command to check for various inconsistencies. It it is designed for reporting problems with various files, for example, missing blocks for a file or under-replicated blocks.
31
job
hadoop job -list


Basic Linux Commands

Commands in linux
Command
Meaning
ls
Will list the items in files
ll
Will list the items vertically
cat
Will help to read the file mentioned in command.
cd ..
Will one directory back
diff
print line difference in files. Not the entire string
grep
grep 'abc ' /chitrank/txt . Search abc text in txt files
grep -v
grep -v 'abc' /chitrank/txt . Not condition of above grep
grep -i
ingnore case
grep -w
to search on given word
cat /chitrank/input/*  | grep -i 'a'
Grep commands with pipe
cat /chitrank/input/*  | grep -l 'a'
to list the file with given word
grep --color 'a'
Will show matches in color
mkdir
will make directory
cd /folder/file
will move to that file directory
awk
find or replace text
chmod
change the access of files or directory // fullacess chmod ug+rwx file.txt  // revoke chmod g-rwx file.txt // give access chmod -R ug+rwx file.txt
chown
change user and group ownership
adduser
add a new user
echo
display message on console
cnp
comapre two files
mv
move file
passwd
modify user password
pwd
print working dir
tar
to
useradd
add new user account
nano
to open editor and make files
sort
to sort file
unzip
unzip a flie
top
find the programs running
kill
kill a task
man
diplays main page
tail
print last lines in file
less
prints less number of outputs
su
switch to other user. Root can switch widout password
yum
to install apche using yum
rpm
to install apche using rpm
date
prints date
get
get file from other loc
put
put file to other loc
find
find files using file name
ssh
remote accesss
cut
cut some part from files
tr
replaces some pattern and changes file output format // tr -s '|' ',' < Ret_MediaTP | cut -f1-28 -d, > MediaTp__Ret.txt
To find File name
find . -type f -printf "%f\n" > /home/hduser1/chitrank/freshseq