Hadoop :- Inside Map Reduce (Job Run ) :- Part – I

Hi

This is my 6th blog and from this blog we are going to start a series of concepts related to map reduce programming

in this series , today we are going to discuss about the job submission-completion  process in map reuduce

I will try not to go into a long boring explanation but to the point explanation

Note :- This article is for classic mapreduce ( MapReduce 1)

Lets Start ,

In classic mapreduce the job run can be divided into four major parts at highest levels which are :-

  • Job Submission by client
  • A job tracker which will track the status and meta data about the job submitted by client
  • A task tracker which is the one who will execute the job in terms of tasks ( will explain later)
  • a distributed file system which we all aware ( HDFS :- Hadoop Distributed File System) , which is used to share the files during a job run – completion process

now lets take each and every part of Job Submission-completion cycle

jobsubmission

lets explain the process further :-  ( Diagram is for reference purpose , will not gonna explain diagram but u can relate it with explanations below)

1. Job Submission :- 

  • when submit() method called  an internal jobSummitter instance created and it calls the submitjobInternal()
  • Job-submission  process implemented by jobsummiter will do the following things:-
  •                         a)  Ask the job tracker for new ID
  •                         b)  will check whether input path given in job submission exist or not , if not then will through an error  else compute the input splits for job
  •                         c) Will check the output path exist or not , if already exist then through an error
  •                         d) if all above steps go ok, then copies the resources needed for run the job , like jar file ,configuration file , computed input splits
  •                         e) In the end after completing above step , will tell the JobTracker that job is submitted successfully                               and ready for execution

2. Job Initialization

As now we successfully submit the job , now its time that the parameters and required components for job execution must be initialized ( this is same process which we can find in any programming activity , like variable initialization )

  • When job tracker receives a call from submitjob() , it put the job into an internal queue from where job scheduler pick it up and initialize
  • When job scheduler picks up the job and initialization ( which means creation of objects to represent job being run and encapsulates the tasks and book keeping information to keep track of status of task)
  • Job scheduler creates one map task for each input split ( which it take from HDFS ( explained earlier) )
  • Number of reduced tasks are determined by the setNumReduceTasks() method in mapreduce programm.
  • In addition to map and reduce task , two more tasks also created :- 1) job setup task   2)Job Cleanup task 
  • these tasks are run by tasktracker , for run code to set up the job before any map task run and to cleanup after all reduce tasks get completed
  • JobSetup taks :- will create temporary work space for task output , and final output directory for job
  • Cleanup task :- Will delete the temporary workspace/output space of task

As of now we are in our tasks as Job scheduler created map,reduce tasks also 2 more tasks , which means our job initialization takes place successfully and now its time for Task assignment

Task Assignment :-

  • Task trackers on regular interval sends signal to job tracker which tell that they are alive ( also these signals contains more useful information’s  like task status,also whether it is ready for a new task or not etc ) , these signals are known as heartbeats
  • in return of these heartbeats signals Jobtracker assigns the task to tasktracker , which also known as heartbeat return value
  • Note :-  these heartbeat signals always runs in loop ( periodically ) 
  • By using help of scheduling algorithms ( by default :- maintain priority list of jobs) , Job tracker choose a job and then choose a task for that job
  • on task trackers there are fixed number of slots which are used to execute the tasks ( map / reduce )
  • Note :-  By default scheduler fills empty tasks slots of map first then reducer
  • Important :-  in reduce task  there is no consideration of data locality
  • but for a maptask . Job tracker accounts the location of task tracker in network and choose the one whose input split is as close as possible to task tracker . In  optimal case task :-  task is data local i.e running on the same node where the input splits resides on
  •  Alternatively , task may be rack local :- i.e on same racks but not on same nodes
  •  some tasks are neither data  local nor rack local

Now As assignment is done , its time to execute the tasks :-

Task Execution :- 

  • as task tracker is assigned the task now its time to run the task
  • first it localize the job Jar by copying it from HDFS to task tracker filesystem , it also coy any files needed for distributed cache
  • after above step it(task tracker)  creates a local directory ( known as working directory) for task , and unjar the jar file
  • now it will create an instance of TaskRunner to execute the task
  • task-runner will launch a new JVM for each task , so that failing of task will not affect tasktracker
  • each task will do the setup and clean up actions ( which are explained in job initialization)
  • then OutPutcommiter comes into picture when clean up and job setup tasks takes place ,when clean up task completed that means that output is written to final location for that task .
  • Commit protocol ensures that when speculative execution is enabled then only one of duplicate task is committed and other is aborted

MapReduce jobs are long running jobs and hence we need to check their progress and status on timely manner , Next we are going to explain about the Progress and Status Updates

Progress and Status Updates :– 

  • when a task is running , it keeps track of it progress , that is proportion of task completed
  • in map task this is the proportion of input that has been processed
  • for Reduce task , it is complex but , it does by dividing the total progress into three parts , corresponding to three phases of shuffle .
  • tasks also have a set of counters that count the various events as tasks runs .

Job Completion :-

  • when job tracker receives a notification that the last task for job is complete , it changes the status of job to successfully completed .
  • it prints a massage to tell the user and then returns from the waitFor Completion()
  • job statistics and counters are printed to console
  • Finally , Job Tracker cleans its working state for job and instructs the tasktrackers to do the same.

And hence we completed our JOB Cycle in MapReduce ,

Hope guys you will get some useful knowledge\information from this article , Will come up with map-reduce phases in next article.

Till then thanks for your time .

cheers 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s