Monthly Archives: November 2013

How to use Hadoop MultipleOutputs

15 Replies

Just like MultipleInputs, Hadoop also supports MultipleOutputs, thanks to the equality, we can output different data/format in the same MapReduce job.

It’s very easy to use this useful feature, as before, I will mainly use Java code to demonstrate the usage, hope the code can explain itself 🙂

Note: I wrote and ran the following code using Hadoop 1.0.3, but it should be working in 0.20.205 as well

Continue reading →

快速批量修改豆瓣电影评分 using Javascript

截止 2016.05.06 依然有效

最近遇到一个问题，想要给早期自己在豆瓣上收藏的看过的电影评分，但问题是一个一个修改的话巨慢，于是写了一个小plugin脚本，只要在firebug的console里运行了就好(Chrome里应该也可以).

前提：登陆自己的豆瓣账户，进入自己的电影列表(需要是列表模式)。

代码：(下面的code默认用到了 jQuery, 豆瓣支持此库)

Continue reading →

use Secondary Sort to keep different inputs in order in Hadoop

SecondarySort is a technique that you can control the order of inputs that comes to Reducers.

For example, we wants to Join two different datasets, One dataset contains the attribute that the other dataset need to use. For simplicity, we call the first dataset ATTRIBUTE set, the other dataset DATA set. Since the ATTRIBUTE set is also very large, it’s not practical to put it in the Distributed Cache.

Now we want to join these two tables, for each record in the DATA set, we get its ATTRIBUTE. If we don’t use SecondarySort, after the map step, the DATA and ATTRIBUTE will come in arbitrary order, so if we want to append the ATTRIBUTE to each DATA, we need to store all the DATA in memory, and later when we meet the ATTRIBUTE, we then assign the ATTRIBUTE to DATA.

Continue reading →