Hi, Please read this in continuation of :- https://haritbigdata.wordpress.com/2015/07/21/hadoop-inside-mapreduce-process-of-shuffling-sorting-part-ii/ In this blog i am going to explain about some features of map reduce which can help us in tuning and optimization of mapreduce programs . Here we go :-
- Most important process of mapreduce program is shuffling of outputs produced by map function .
- So we need to concentrate mainly on map phase for better optimization of mapreduce programs .
hence , We should do the following things for optimization :-
- We should give as much as more memory to shuffle process , but also need to keep in mind that we also give sufficient memory to map and reduce function .
- Amount of memory given to JVM for map reduce tasks is set by :- mapred.child.java.opts , we need to make this value more as much as we can.
- Map Side optimization :- a) optimization can be done by minimizing the multi spills ,, which can be controlled by io.sort.*
b) We can use counters to check about the count of spill records c) We should increase io.sort.mb :- which is used as amount of memory buffer used while sorting out the map output d) io.sort.spill.percent :- threshold value for using memory buffer , afterwards records started to spill. Default value :- 0.80 e) io.sort.factor :- property which help to merge output streams by map. We should increase this upto 100 for optimization . f) mapred.compress.map.output :- we should compress the ouptput of mapper phase , it saves space as well increase transfer of data between tasks. Above are the properties which can be use to optimize the map side tasks. Following are the main reduce phase properties which can help us for optimization of reduce phase as well :-
- mapred.reduce.parallel.copies : – default value 5 :- number of threads use to copy map outputs to reduce task node
- mapred.reduce.copy.backoff :- Default value 300 :- Time in seconds to retrieve a map output to reducer before declaring it failed.
- io.sort.factor :- property which help to merge output streams by map. We should increase this upto 100 for optimization .
Above are the properties which can be use to optimize the Reduce side tasks. Hope these above properties help you for optimization of mapreduce task/program. Thanks, Cheers. 🙂