Tuesday, 20 May 2014

My Experience of Preparing and Clearing Cloudera Hadoop Developer Certification CCD-410

It feels ecstatic to complete this certification with decent score in first attempt. It  was like those struggling days when we used to study hard for our final exams during college... 

My preparation started as I joined a new company where i landed in hadoop . I had zero idea about hadoop and was worried how can i excel in this field.
As the project demanded Hadoop Mapreduce i first started developing basic MR codes.
Having basic experience in Java helped me , slowly and steadily by solving assignments and reading about hadoop ecosystem I started liking the subject .
Whatever i used to read or understand i would draw the concepts on a paper and publish this on my blog.
Blogging gave me a great booster and i developed more interest of learning by creating or reading blogs.
After gaining good 6 months of knowledge I decided to crack CCD-410 .

So some much needed tips : -


  •  Have basic exp in Java. OOPS concepts  , ETL background (optional)
  •  Complete study of the Hadoop Definitive guide . Its the bible of Hadoop . 
  •  Clear understanding of all Hadoop Concepts practically and theoretically . Try so write as such MR code.. developing complex key, value pair !
  • Yahoo MR Tutorials. . Good for beginners . 
  •  Hadoop ecosystem is huge .. Try to have glimpse of sub projects like Hbase, Hive , pig , Flume ,Oozie , Sqoop.. Study what they do and where can they be useful . 
  • Try to blog ! blogs are very important . Read as many blogs on hadoop . Best would be create your own BLOG !
  •  If you are completely new to hadoop and require some training .. you will get many institute who provide it.. I feel Nixus Technologies  is coming up well. 
  • The exam contains the same pattern as discussed here Study Guide  . Read the guide and understand which module holds importance .
  • If you guys feel weak in MR coding you can download developer simulator from Hadoop Developer Simulator and start solving. Its a very good software and exams questions are very much similar as mentioned in it.
  •  Have  belief and Trust in yourself and Crack It ! All the very Best !


                           

Wednesday, 2 April 2014

Hadoop Distributed File System Architecture





What is HDFS  ? 

The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The file system uses the TCP/IP layer for communication. Clients use Remote procedure call (RPC) to communicate between each other.

HDFS stores large files (typically in the range of gigabytes to terabytes) across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX-compliant, because the requirements for a POSIX file-system differ from the target goals for a Hadoop application. The trade-off of not having a fully POSIX-compliant file-system is increased performance for data throughput and support for non-POSIX operations such as Append.

The HDFS file system includes a so-called secondary namenode, which misleads some people into thinking that when the primary namenode goes offline, the secondary namenode takes over. In fact, the secondary namenode regularly connects with the primary namenode and builds snapshots of the primary namenode’ s directory information, which the system then saves to local or remote directories. These check pointed images can be used to restart a failed primary namenode without having to replay the entire journal of file-system actions, then to edit the log to create an up-to-date directory structure. Because the namenode is the single point for storage and management of metadata, it can become a bottleneck for supporting a huge number of files, especially a large number of small files. HDFS Federation, a new addition, aims to tackle this problem to a certain extent by allowing multiple name-spaces served by separate namenodes.

An advantage of using HDFS is data awareness between the job tracker and task tracker. The job tracker schedules map or reduce jobs to task trackers with an awareness of the data location.





LIMITATIONS:-
HDFS was designed for mostly immutable files and may not be suitable for systems requiring concurrent write-operations.
Another limitation of HDFS is that it cannot be mounted directly by an existing operating system. Getting data into and out of the HDFS file system, an action that often needs to be performed before and after executing a job, can be inconvenient.

Friday, 28 February 2014

Tuesday, 21 January 2014

Shuffle and Sort In Mapreduce

Rack Awareness in hadoop


Rack Awareness

·        For small clusters in which all servers are connected by a single switch, there are only two levels of locality: "on-machine" and "off-machine." When loading data from a DataNode's local drive into HDFS, the NameNode will schedule one copy to go into the local DataNode, and will pick two other machines at random from the cluster.
·        For larger Hadoop installations which span multiple racks, it is important to ensure that replicas of data exist on multiple racks. This way, the loss of a switch does not render portions of the data unavailable due to all replicas being underneath it.
·        HDFS can be made rack-aware by the use of a script which allows the master node to map the network topology of the cluster. While alternate configuration strategies can be used, the default implementation allows you to provide an executable script which returns the "rack address" of each of a list of IP addresses.
·        The network topology script receives as arguments one or more IP addresses of nodes in the cluster. It returns on stdout a list of rack names, one for each input. The input and output order must be consistent.
·        To set the rack mapping script, specify the key topology.script.file.name in conf/hadoop-site.xml. This provides a command to run to return a rack id; it must be an executable script or program. By default, Hadoop will attempt to send a set of IP addresses to the file as several separate command line arguments. You can control the maximum acceptable number of arguments with the topology.script.number.args key.
·        Rack ids in Hadoop are hierarchical and look like path names. By default, every node has a rack id of /default-rack. You can set rack ids for nodes to any arbitrary path, e.g., /foo/bar-rack. Path elements further to the left are higher up the tree. Thus a reasonable structure for a large installation may be /top-switch-name/rack-name.
·        Hadoop rack ids are not currently expressive enough to handle an unusual routing topology such as a 3-d torus; they assume that each node is connected to a single switch which in turn has a single upstream switch. This is not usually a problem, however. Actual packet routing will be directed using the topology discovered by or set in switches and routers. The Hadoop rack ids will be used to find "near" and "far" nodes for replica placement (and in 0.17, MapReduce task placement).


Fair Scheduler in Hadoop


Fair Scheduler
·       The core idea behind the fair share scheduler was to assign resources to jobs such that on average over time, each job gets an equal share of the available resources. The result is that jobs that require less time are able to access the CPU and finish intermixed with the execution of jobs that require more time to execute. This behaviour allows for some interactivity among Hadoop jobs and permits greater responsiveness of the Hadoop cluster to the variety of job types submitted. The fair scheduler was developed by Facebook.

·       The Hadoop implementation creates a set of pools into which jobs are placed for selection by the scheduler.
Each pool can be assigned a set of shares to balance resources across jobs in pools (more shares equals greater resources from which jobs are executed). By default, all pools have equal shares, but configuration is possible to provide more or fewer shares depending upon the job type.
The number of jobs active at one time can also be constrained, if desired, to minimize congestion and allow work to finish in a timely manner.

·       To ensure fairness, each user is assigned to a pool. In this way, if one user submits many jobs, he or she can receive the same share of cluster resources as all other users (independent of the work they have submitted). Regardless of the shares assigned to pools, if the system is not loaded, jobs receive the shares that would otherwise go unused (split among the available jobs).





·       The scheduler implementation keeps track of the compute time for each job in the system. Periodically, the scheduler inspects jobs to compute the difference between the compute time the job received and the time it should have received in an ideal scheduler. The result determines the deficit for the task. The job of the scheduler is then to ensure that the task with the highest deficit is scheduled next.



Capacitive Scheduling in Hadoop


Capacity Scheduler
·        The capacity scheduler shares some of the principles of the fair scheduler but has distinct differences, too.

·        First, capacity scheduling was defined for large clusters, which may have multiple, independent consumers and target applications. For this reason, capacity scheduling provides greater control as well as the ability to provide a minimum capacity guarantee and share excess capacity among users. The capacity scheduler was developed by Yahoo!.

·        In capacity scheduling, instead of pools, several queues are created, each with a configurable number of map and reduce slots. Each queue is also assigned a guaranteed capacity (where the overall capacity of the cluster is the sum of each queue's capacity).


·        Queues are monitored; if a queue is not consuming its allocated capacity, this excess capacity can be temporarily allocated to other queues. Given that queues can represent a person or larger organization, any available capacity is redistributed for use by other users.

·        Another difference of fair scheduling is the ability to prioritize jobs within a queue. Generally, jobs with a higher priority have access to resources sooner than lower-priority jobs. The Hadoop road map includes a desire to support pre-emption (where a low-priority job could be temporarily swapped out to allow a higher-priority job to execute), but this functionality has not yet been implemented.


·        Another difference is the presence of strict access controls on queues (given that queues are tied to a person or organization). These access controls are defined on a per-queue basis. They restrict the ability to submit jobs to queues and the ability to view and modify jobs in queues.




How does a Single file get stored in HDFS

Here is a neat diagram showing how file is split and stored in HDFS


MapReduce Architecture !

Hello,

I feel better than reading books, Images speaks louder !!
Just have a complete look and then study from Hadoop Definitive Guide.


Hadoop Basic Commands

Hadoop Commands
No
Command
Command Usage
Description
1
cat
hadoop dfs -cat <path>
prints the file contents
2
chgrp
hadoop dfs -chgrp [-R] GROUP URI [URI …]
Change group association of files. With -R, make the change recursively through the directory structure. The user must be the owner of files, or else a super-user.
3
chmod
hadoop dfs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI
4
chown
hadoop dfs -chown [-R] [OWNER][:[GROUP]] URI [URI ]
5
copyFromLocal
hadoop dfs -copyFromLocal <localsrc> URI
6
copyToLocal
hadoop dfs -copyToLocal [-ignorecrc] [-crc] URI
7
count
hadoop dfs -count [-q] <paths>
8
cp
hadoop dfs -cp URI [URI …] <dest>
9
du
hadoop dfs -du URI [URI …]
Get size of each file in dir
10
dus
hadoop dfs -dus URI [URI …]
Gets total file size
11
expunge
hadoop dfs -expunge
Empty the Trash
12
get
hadoop dfs -get [-ignorecrc] [-crc] <src> <localdst>
13
getmerge
hadoop dfs -getmerge <src> <localdst> [addnl]
Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally addnl can be set to enable adding a newline character at the end of each file
14
ls
hadoop dfs -ls path
15
lsr
hadoop dfs -lsr <args>
Recursive version of ls. Similar to Unix ls -R
16
mkdir
17
moveFromLocal
18
moveToLocal
19
mv
20
put
21
rm
22
rmr
hadoop dfs -rmr [-skipTrash] URI [URI …]
23
setrep
hadoop dfs -setrep [-R] <path>
Changes the replication factor of a file. -R option is for recursively increasing the replication factor of files within a directory.
24
stat
hadoop dfs -stat URI [URI …]
25
tail
hadoop dfs -tail [-f] URI
Displays last kilobyte of the file to stdout. -f option can be used as in Unix
26
test
hadoop dfs -test -[ezd] URI
e check to see if the file exists. Return 0 if true.
z check to see if the file is zero length. Return 0 if true.
-d check to see if the path is directory. Return 0 if true
27
text
hadoop dfs -text <src>
Takes a source file and outputs the file in text format
28
touchz
hadoop dfs -touchz URI [URI …]
Create a file of zero length.
29
jar
hadoop jar
The hadoop jar command runs a JAR file.
30
fsck
hadoop fsck
HDFS supports the fsck command to check for various inconsistencies. It it is designed for reporting problems with various files, for example, missing blocks for a file or under-replicated blocks.
31
job
hadoop job -list


Basic Linux Commands

Commands in linux
Command
Meaning
ls
Will list the items in files
ll
Will list the items vertically
cat
Will help to read the file mentioned in command.
cd ..
Will one directory back
diff
print line difference in files. Not the entire string
grep
grep 'abc ' /chitrank/txt . Search abc text in txt files
grep -v
grep -v 'abc' /chitrank/txt . Not condition of above grep
grep -i
ingnore case
grep -w
to search on given word
cat /chitrank/input/*  | grep -i 'a'
Grep commands with pipe
cat /chitrank/input/*  | grep -l 'a'
to list the file with given word
grep --color 'a'
Will show matches in color
mkdir
will make directory
cd /folder/file
will move to that file directory
awk
find or replace text
chmod
change the access of files or directory // fullacess chmod ug+rwx file.txt  // revoke chmod g-rwx file.txt // give access chmod -R ug+rwx file.txt
chown
change user and group ownership
adduser
add a new user
echo
display message on console
cnp
comapre two files
mv
move file
passwd
modify user password
pwd
print working dir
tar
to
useradd
add new user account
nano
to open editor and make files
sort
to sort file
unzip
unzip a flie
top
find the programs running
kill
kill a task
man
diplays main page
tail
print last lines in file
less
prints less number of outputs
su
switch to other user. Root can switch widout password
yum
to install apche using yum
rpm
to install apche using rpm
date
prints date
get
get file from other loc
put
put file to other loc
find
find files using file name
ssh
remote accesss
cut
cut some part from files
tr
replaces some pattern and changes file output format // tr -s '|' ',' < Ret_MediaTP | cut -f1-28 -d, > MediaTp__Ret.txt
To find File name
find . -type f -printf "%f\n" > /home/hduser1/chitrank/freshseq