Hadoop :Program for data/Out Put compression

Hi,

Important things about output compression in MapReduce:-

  • It reduces the space needed to store data
  • Speed up the transfer across network
  • Main optimization options are :- 1 for optimization of speed , 9 for optimization of space
  • eg :- gzip 1 filename
  • Codecs:- in hadoop codecs is implemenation of compression and decompression algorithm

Below is the sample program for Compression of output in hadoop mapreduce

Input file :-

user@ubuntuvm:~/Desktop/hadoop$ hadoop fs -cat /testemp
15/08/21 09:58:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
003 Amit Delhi India 12000
004 Anil Delhi India 15000
005 Deepak Delhi India 34000
006 Fahed Agra India 45000
007 Ravi Patna India 98777
008 Avinash Punjab India 120000
009 Saajan Punjab India 54000
001 Harit Delhi India 20000
002 Hardy Agra India 20000
110 Kaushal Agra India 90000
113 Abhi Punjab India 12999
141 Ajay Patna India 120000
003 Amit Delhi India 12000
004 Anil Delhi India 15000
005 Deepak Delhi India 34000
006 Fahed Agra India 45000
007 Ravi Patna India 98777
008 Avinash Punjab India 120000
009 Saajan Punjab India 54000
001 Harit Delhi India 20000
002 Hardy Agra India 20000
110 Kaushal Agra India 90000
113 Abhi Punjab India 12999
141 Ajay Patna India 120000

Aim :- to reduce duplicate data and compress output data 

Mapper :-   mapcompression.class

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class mapcompression extends Mapper<LongWritable , Object, IntWritable, Text>

{
public void map(LongWritable key,Object value,Context ctx) throws IOException, InterruptedException

{
String str[] =value.toString().split(“\\s”);
ctx.write(new IntWritable(Integer.parseInt(str[0])),new Text(str[1].toString()));

}

}

Reducer :-  reducecompression.class

import java.io.IOException;
import java.util.Arrays;
import java.util.List;
import java.util.TreeSet;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class reducecompression extends Reducer<IntWritable, Text, IntWritable, Text>

{

public void reduce(IntWritable key,Iterable<Text> itr,Context ctx) throws IOException, InterruptedException
{ String s = “”; // ctx.write(key, );
for (Text val : itr)
{

s = val.toString();

}
ctx.write(key,new Text(s.toString()));
}
}

Main Class :-

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class maincompression {

public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
// TODO Auto-generated method stub

Configuration conf = new Configuration();
Job job = Job.getInstance(conf,”comress”);
job.setMapperClass(mapcompression.class);
job.setReducerClass(reducecompression.class);
//job.setNumReduceTasks(0);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);

}

}

RUN :- 

Step 1 :- 

hadoop jar compressionhadoop.jar maincompression /testemp /mainoutput

Step 2 :- 

hadoop fs -copyToLocal /mainoutput/part-r-00000.gz /home/user/Desktop/hadoop/main/

Step 3 :- 

user@ubuntuvm:~/Desktop/hadoop$ gunzip -c part-r-00000.gz
1 Harit
2 Hardy
3 Amit
4 Anil
5 Deepak
6 Fahed
7 Ravi
8 Avinash
9 Saajan
110 Kaushal
113 Abhi
141 Ajay

Advertisements

One thought on “Hadoop :Program for data/Out Put compression

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s