Tag Archives: DistributedCache

Use Hadoop DistributedCache to cache files in MapReduce

DistributedCache is a very useful Hadoop feature that enables you to pass resource files to each mapper or reducer.

For example, you have a file stopWordList.txt that contains all the stop words you want to exclude when you do word count. And In your reducer, you want to check each value passed by mapper, if the value appears in the stop word list, we pass it and goes to the next value.

In order to use DistributedCache, first you need to set the file in the job configuration driver:

Continue reading