Hadoop :- Tuning of mapreduce program

Hi, Please read this in continuation of :-  https://haritbigdata.wordpress.com/2015/07/21/hadoop-inside-mapreduce-process-of-shuffling-sorting-part-ii/ In this blog i am going to explain about some features of map reduce which can help us in tuning and optimization of mapreduce programs . Here we go :-

  • Most important process of mapreduce program is shuffling of outputs produced by map function .
  • So we need to concentrate mainly on map phase for better optimization of mapreduce programs .

hence , We should do the following things for optimization :-

  1. We should give as much as more memory to shuffle process , but also need to keep in mind that we also give sufficient memory to map and reduce function .
  2. Amount of memory given to JVM for map reduce tasks is set by :- mapred.child.java.opts   , we need to make this value more as much as we can.
  3. Map Side optimization :-   a)  optimization can be done by minimizing the multi spills ,, which can be controlled by  io.sort.*

b)  We can use counters to check about the count of spill records  c)  We should increase io.sort.mb  :- which is used as amount of memory buffer used while sorting out the                  map output         d)  io.sort.spill.percent :-  threshold value for using memory buffer , afterwards records started to spill.                     Default value :- 0.80   e) io.sort.factor :-  property which help to merge output streams by map. We should increase this upto 100               for optimization . f) mapred.compress.map.output :- we should compress the ouptput of mapper phase , it saves space as                  well increase transfer of data between tasks.     Above are the properties which can be use to optimize the map side tasks. Following are the main reduce phase properties which can help us for optimization of reduce phase as well :-

  • mapred.reduce.parallel.copies : – default value 5 :- number of threads use to copy map outputs to reduce task node
  • mapred.reduce.copy.backoff :- Default value 300 :- Time in seconds to retrieve a map output to reducer before declaring it failed. 
  • io.sort.factor :-  property which help to merge output streams by map. We should increase this upto 100         for optimization .

Above are the properties which can be use to optimize the Reduce  side tasks. Hope these above properties help you for optimization of mapreduce task/program. Thanks, Cheers. 🙂

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s