Use Hadoop DistributedCache to cache files in MapReduce

DistributedCache is a very useful Hadoop feature that enables you to pass resource files to each mapper or reducer.

For example, you have a file stopWordList.txt that contains all the stop words you want to exclude when you do word count. And In your reducer, you want to check each value passed by mapper, if the value appears in the stop word list, we pass it and goes to the next value.

In order to use DistributedCache, first you need to set the file in the job configuration driver:

Path stopWordListPath = new Path("s3://my-bucket/stopWordList.txt");
DistributedCache.addCacheFile(stopWordListPath.toUri(), job.getConfiguration());

The thing is that when you did that in the driver, hadoop will automatically send the file to each node.

Then in you reducer, you read the file and put the words in a Set. I personally perfer to do this in setup() method, which will be called before any reducer()

@Override
protected void setup(Context context){
     Path [] cacheFile = DistributedCache.getLocalCacheFiles(context.getConfiguration());
     //use the returned Path objects to get the file and read it.
}

very easy right?

Leave a Reply

Your email address will not be published. Required fields are marked *