Monthly Archives: November 2013

How to use Hadoop MultipleOutputs

Just like MultipleInputs, Hadoop also supports MultipleOutputs, thanks to the equality, we can output different data/format in the same MapReduce job.

It’s very easy to use this useful feature, as before, I will mainly use Java code to demonstrate the usage, hope the code can explain itself 🙂

Note: I wrote and ran the following code using Hadoop 1.0.3, but it should be working in 0.20.205 as well

Continue reading

快速批量修改豆瓣电影评分 using Javascript

截止 2016.05.06 依然有效




代码:(下面的code默认用到了 jQuery, 豆瓣支持此库)

Continue reading

use Secondary Sort to keep different inputs in order in Hadoop

SecondarySort is a technique that you can control the order of inputs that comes to Reducers.

For example, we wants to Join two different datasets, One dataset contains the attribute that the other dataset need to use. For simplicity, we call the first dataset ATTRIBUTE set, the other dataset DATA set. Since the ATTRIBUTE set is also very large, it’s not practical to put it in the Distributed Cache.

Now we want to join these two tables, for each record in the DATA set, we get its ATTRIBUTE. If we don’t use SecondarySort, after the map step,  the DATA and ATTRIBUTE will come in arbitrary order, so if we want to append the ATTRIBUTE to each DATA, we need to store all the DATA in memory, and later when we meet the ATTRIBUTE, we then assign the ATTRIBUTE to DATA.

Continue reading