{"id":197,"date":"2012-06-20T17:53:44","date_gmt":"2012-06-20T21:53:44","guid":{"rendered":"http:\/\/lichun.cc\/blog\/?p=197"},"modified":"2012-06-20T17:53:44","modified_gmt":"2012-06-20T21:53:44","slug":"use-hadoop-distributedcache-to-cache-files-in-mapreduce","status":"publish","type":"post","link":"https:\/\/www.lichun.cc\/blog\/2012\/06\/use-hadoop-distributedcache-to-cache-files-in-mapreduce\/","title":{"rendered":"Use Hadoop DistributedCache to cache files in MapReduce"},"content":{"rendered":"<p><b>DistributedCache<\/b> is a very useful Hadoop feature that enables you to pass resource files to each mapper or reducer. <\/p>\n<p>For example, you have a file <b>stopWordList.txt<\/b> that contains all the stop words you want to exclude when you do word count. And In your reducer, you want to check each value passed by mapper, if the value appears in the <b>stop word list<\/b>, we pass it and goes to the next value.<\/p>\n<p>In order to use <b>DistributedCache<\/b>, first you need to set the file in the job configuration <b>driver<\/b>:<\/p>\n<p><!--more--><\/p>\n<pre>\nPath stopWordListPath = new Path(\"s3:\/\/my-bucket\/stopWordList.txt\");\nDistributedCache.addCacheFile(stopWordListPath.toUri(), job.getConfiguration());\n<\/pre>\n<p>The thing is that when you did that in the driver, hadoop will automatically send the file to each node.<\/p>\n<p>Then in you reducer, you read the file and put the words in a Set. I personally perfer to do this in <b>setup()<\/b> method, which will be called before any <b>reducer()<\/b><\/p>\n<pre>\n@Override\nprotected void setup(Context context){\n     Path [] cacheFile = DistributedCache.getLocalCacheFiles(context.getConfiguration());\n     \/\/use the returned Path objects to get the file and read it.\n}\n<\/pre>\n<p>very easy right?<\/p>\n","protected":false},"excerpt":{"rendered":"<p>DistributedCache is a very useful Hadoop feature that enables you to pass resource files to each mapper or reducer. For example, you have a file stopWordList.txt that contains all the stop words you want to exclude when you do word count. And In your reducer, you want to check each value passed by mapper, if [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_mi_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true},"categories":[19],"tags":[39,16,83],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p2s9sh-3b","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/www.lichun.cc\/blog\/wp-json\/wp\/v2\/posts\/197"}],"collection":[{"href":"https:\/\/www.lichun.cc\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.lichun.cc\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.lichun.cc\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.lichun.cc\/blog\/wp-json\/wp\/v2\/comments?post=197"}],"version-history":[{"count":0,"href":"https:\/\/www.lichun.cc\/blog\/wp-json\/wp\/v2\/posts\/197\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.lichun.cc\/blog\/wp-json\/wp\/v2\/media?parent=197"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.lichun.cc\/blog\/wp-json\/wp\/v2\/categories?post=197"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.lichun.cc\/blog\/wp-json\/wp\/v2\/tags?post=197"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}