{"id":568,"date":"2013-11-21T17:37:17","date_gmt":"2013-11-21T21:37:17","guid":{"rendered":"http:\/\/www.lichun.cc\/blog\/?p=568"},"modified":"2013-11-21T17:37:17","modified_gmt":"2013-11-21T21:37:17","slug":"how-to-use-hadoop-multipleoutputs","status":"publish","type":"post","link":"https:\/\/www.lichun.cc\/blog\/2013\/11\/how-to-use-hadoop-multipleoutputs\/","title":{"rendered":"How to use Hadoop MultipleOutputs"},"content":{"rendered":"<p>Just like MultipleInputs, Hadoop also supports MultipleOutputs, thanks to the equality, we can output different data\/format in the same MapReduce job.<\/p>\n<p>It&#8217;s very easy to use this useful feature, as before, I will mainly use Java code to demonstrate the usage, hope the code can explain itself \ud83d\ude42<\/p>\n<p><b>Note: I wrote and ran the following code using Hadoop 1.0.3, but it should be working in 0.20.205 as well<\/b><\/p>\n<p><!--more--><\/p>\n<h1>1. MultipleOutputs class<\/h1>\n<p>First of all, import the MultipleOutputs,<br \/>\n<code>import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;<\/code><\/p>\n<h1>2. introduce the <code>MultipleOutputs.addNamedOutput<\/code><\/h1>\n<p>There are 5 parameters for this method:<\/p>\n<pre><b>Job job<\/b>\r\n         pass the haddop Job created\r\n<b>String namedOutput<\/b>\r\n         give a unique name for this output, the output for\r\n         this one will be nameOutput-r-XXXXX\r\n<b>Class&lt;? extends OutputFormat&gt; outputFormatClass<\/b>  \r\n         If you have a custom output format, pass the output \r\n         format in, if you just output text format,     \r\n         use the hadoop TextOutputFormat.class\r\n<b>Class&lt;?&gt; keyClass<\/b> \r\n         the class type of the key, if you don't output key, \r\n         use <b>NullWritable.class<\/b>\r\n<b>Class&lt;?&gt; valueClass<\/b>\r\n         the class type of the value, if you have a custom \r\n         value class, use it here, if the value is text \r\n         format, use Text.class<\/pre>\n<h1>3. Codes<\/h1>\n<p>What I tried to do here is to separate the columns for a given input, different columns go to different output.<\/p>\n<p>Sample Data:<\/p>\n<p>1 APPLE RED<br \/>\n2 ORANGE BLACK<br \/>\n3 BANANA GREEN<\/p>\n<p>Here I want to separate the <b>fruit<\/b> column and the <b>color<\/b> column.<\/p>\n<h2>3.1 Setup the driver for this MapReduce job:<\/h2>\n<pre>public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {\r\n   Path inputDir = new Path(args[0]);\r\n   Path outputDir = new Path(args[1]);\r\n\r\n   Configuration conf = new Configuration();\r\n\r\n   Job job = new Job(conf);\r\n   job.setJarByClass(MultipleOutputsTest.class);\r\n   job.setJobName(\"MultipleOutputs Test\");\r\n\r\n   job.setMapOutputKeyClass(Text.class);\r\n   job.setMapOutputValueClass(Text.class);\r\n\r\n   job.setMapperClass(myMapper.class);\r\n   job.setReducerClass(myReducer.class);\r\n\r\n   FileInputFormat.setInputPaths(job, inputDir);\r\n   FileOutputFormat.setOutputPath(job, outputDir);\r\n\r\n   <b style=\"color: red;\">MultipleOutputs.addNamedOutput(job, fruitOutputName, TextOutputFormat.class, NullWritable.class, Text.class);<\/b>\r\n   <b style=\"color: red;\">MultipleOutputs.addNamedOutput(job, colorOutputName, TextOutputFormat.class, NullWritable.class, Text.class);<\/b>\r\n\r\n   job.waitForCompletion(true);\r\n}<\/pre>\n<p>The <b style=\"color: red;\">fruitOutputName<\/b> and <b style=\"color: red;\">colorOutputName<\/b> are string I defined, they are &#8220;fruit&#8221; and &#8220;color&#8221; respectively, so for fruit output, the file name will be <b>fruit-r-000XX<\/b>.<\/p>\n<h2>3.2 Reducer<\/h2>\n<p>The next important part is the reducer. For single output, we use <code>context.write(KEY, VALUE)<\/code>, but here it&#8217;s different.<\/p>\n<pre>\r\npublic static class myReducer extends Reducer&lt;Text, Text, Text, Writable&gt; {\r\n    MultipleOutputs&lt;Text, Text&gt; mos;\r\n\r\n    @override\r\n    public void setup(Context context) {\r\n        <b style=\"color: red;\">mos = new MultipleOutputs(context);<\/b>\r\n    }\r\n\r\n    public void reduce(Text key, Iterable&lt;Text&gt; values, Context context) throws IOException, InterruptedException {\r\n        for (Text value : values) {\r\n            String str = value.toString();\r\n            String[] items = str.split(\"\\t\");\r\n\r\n            <b style=\"color: red;\">mos.write(fruitOutputName, NullWritable.get(), new Text(items[1]));<\/b>\r\n            <b style=\"color: red;\">mos.write(colorOutputName, NullWritable.get(), new Text(items[2]));<\/b>\r\n        }\r\n    }\r\n\r\n    @override\r\n    protected void cleanup(Context context) throws IOException, InterruptedException {\r\n        <b style=\"color: red;\">mos.close();<\/b>\r\n    }\r\n}\r\n<\/pre>\n<p>Please pay attention to the <b>setup<\/b> and <b>cleanup<\/b> function, there will be error if you didn&#8217;t initialize or close the MultipleOutputs object.<\/p>\n<h2>3.3 Output<\/h2>\n<p>As we expected, the output of the sample inputs will be:<\/p>\n<pre>\nfruit-r-00000:<br \/>\n    APPLE<br \/>\n    ORANGE<br \/>\n    BANANA<br \/>\ncolor-r-00000:<br \/>\n    RED<br \/>\n    BLACK<br \/>\n    GREEN<\/p>\n<p>Here is the example code I used. <a href=\"http:\/\/www.lichun.cc\/blog\/wp-content\/uploads\/2013\/11\/MultipleOutputsTest.zip\">Download<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Just like MultipleInputs, Hadoop also supports MultipleOutputs, thanks to the equality, we can output different data\/format in the same MapReduce job. It&#8217;s very easy to use this useful feature, as before, I will mainly use Java code to demonstrate the usage, hope the code can explain itself \ud83d\ude42 Note: I wrote and ran the following [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_mi_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true},"categories":[19],"tags":[16,83,76],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p2s9sh-9a","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/www.lichun.cc\/blog\/wp-json\/wp\/v2\/posts\/568"}],"collection":[{"href":"https:\/\/www.lichun.cc\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.lichun.cc\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.lichun.cc\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.lichun.cc\/blog\/wp-json\/wp\/v2\/comments?post=568"}],"version-history":[{"count":20,"href":"https:\/\/www.lichun.cc\/blog\/wp-json\/wp\/v2\/posts\/568\/revisions"}],"predecessor-version":[{"id":589,"href":"https:\/\/www.lichun.cc\/blog\/wp-json\/wp\/v2\/posts\/568\/revisions\/589"}],"wp:attachment":[{"href":"https:\/\/www.lichun.cc\/blog\/wp-json\/wp\/v2\/media?parent=568"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.lichun.cc\/blog\/wp-json\/wp\/v2\/categories?post=568"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.lichun.cc\/blog\/wp-json\/wp\/v2\/tags?post=568"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}