记忆碎片@影

经常回到豆瓣最主要的原因,是去记录自己看过的电影。再有几十部,那个数字就要满一千了。我与电影的缘分,自然是从那个数字起跳开始。

有时会想自己看过最早的电影是哪部,当真记不清了。小学到了南洋学校后,学校的电视台会在周末放映一些租来的VCD碟片,每当周末不回家住校,都和同学一起在教室看校电视台放的影片。好像是五年级的时候,一天父母单位组织出去看《泰坦尼克号》,回来后他们都说很好看,后来学校的电视台也放映了,里面的大场景和宏大的音乐记忆颇深。我最喜欢的的电影之一《大话西游》就是在高小的时候看的,那时候还是父亲拿回来的录像带。后来看到很多文章说《大话西游》刚放映的时候反响并不太好,但我真的是小学第一次看就无比喜欢。

Continue reading

use Hive Partition to Read/Write with subfolders

We all know that Hive read/write data in folder level, the limit here is that by default it will only read/write the files from/to the folder specified. But sometimes, our input data are organized by using subfolders, then Hive cannot read them if you only specify the root folder; or you want to output to separate folders instead of putting all the output data in the same folder.

For example, we have sales data dumped to hdfs(or s3), and their path structure is like sales/city=BEIJING/day=20140401/data.tsv , as you can see, the data is partitioned by city and day, although we can copy all the data.tsv to the same folder, we need to do the copy and change the filename to avoid conflict, it will be a pain if the files are a lot and huge. On the other hand, even if we do copy all the data.tsv to the same folder, when output, we want to separate the output to different folders by city and day, how to do that?

Can hive be smart enough to read all the subfolder’s data and output to separate folders? The answer is Yes.

Continue reading

记忆碎片@书

最早的读书记忆,应该是上学前,从姥爷家的箱子里翻出的舅舅看过的小人书。巴掌大得图画,开始了我的翻阅故事。

印象中还记得的小人书有《八仙过海》,香港电视剧的截图版;《神力王》,近代列强侵略中国,仁人志士是如何在擂台上扬眉吐气;当然少不了《丁丁历险记》,表情无辜机智勇敢的洋葱头带领小白雪历险的故事。

Continue reading

How to use Hadoop MultipleOutputs

Just like MultipleInputs, Hadoop also supports MultipleOutputs, thanks to the equality, we can output different data/format in the same MapReduce job.

It’s very easy to use this useful feature, as before, I will mainly use Java code to demonstrate the usage, hope the code can explain itself 🙂

Note: I wrote and ran the following code using Hadoop 1.0.3, but it should be working in 0.20.205 as well

Continue reading

快速批量修改豆瓣电影评分 using Javascript

截止 2016.05.06 依然有效

最近遇到一个问题,想要给早期自己在豆瓣上收藏的看过的电影评分,但问题是一个一个修改的话巨慢,于是写了一个小plugin脚本,只要在firebug的console里运行了就好(Chrome里应该也可以).

前提:登陆自己的豆瓣账户,进入自己的电影列表(需要是列表模式)。

movie_list_before

代码:(下面的code默认用到了 jQuery, 豆瓣支持此库)

Continue reading

use Secondary Sort to keep different inputs in order in Hadoop

SecondarySort is a technique that you can control the order of inputs that comes to Reducers.

For example, we wants to Join two different datasets, One dataset contains the attribute that the other dataset need to use. For simplicity, we call the first dataset ATTRIBUTE set, the other dataset DATA set. Since the ATTRIBUTE set is also very large, it’s not practical to put it in the Distributed Cache.

Now we want to join these two tables, for each record in the DATA set, we get its ATTRIBUTE. If we don’t use SecondarySort, after the map step,  the DATA and ATTRIBUTE will come in arbitrary order, so if we want to append the ATTRIBUTE to each DATA, we need to store all the DATA in memory, and later when we meet the ATTRIBUTE, we then assign the ATTRIBUTE to DATA.

Continue reading

怀念仙剑-词一首

2007.05.27 作于北大,最初写在校内网,拷贝至此留念    

仙字一把,剑御前

万千个小鬼寻衅

呵声怒,冲它个乱,影不全

灵绕,柔意里是那笑颜

月洒,夜半衣披肩

还有奴,精怪翩翩

情满,不羡神仙

身去,泪予谁?搁浅

也罢

一切本是因缘

仙字难圆,剑敛

然这份感动,长留心间

————————-

一下午,脑中总是浮现出仙剑的画面,月如在圣姑床上冷冰冰的身体,灵儿与水魔兽的同归于尽,月如抱着忆如出现在雪地……“我们三个永远不要分开”“吃到老,玩到老”,月如的幸福是那样的让人心碎。我总和别人说我是灵儿派的,可是哪分那么清楚,他们给我带来的感动,已无法衡量。

Understand Bayes Theorem (prior/likelihood/posterior/evidence)

Bayes Theorem is a very common and fundamental theorem used in Data mining and Machine learning. Its formula is pretty simple:

P(X|Y) = ( P(Y|X) * P(X) ) / P(Y), which is Posterior = ( Likelihood * Prior ) /  Evidence

So I was wondering why they are called correspondingly like that.

Let’s use an example to find out their meanings.

Continue reading

Use a lookup HashMap in hive script with UDF

I was using custom jar for my mapreduce job in the past few years, and because it’s pure java programming, I have a lot of flexibility. But writing java results in a lot of code to maintain, and most of the mapreduce jobs are just joining with a little spice in it, so moving to Hive may be a better path.

Problem

The mapreduce job I face here is to left outer join two different datasets using the same keys, because it’s a outer join, there will be null values, and for these null values, I want to lookup the default values to assign from a map.

For example, I have two datasets:
dataset 1: KEY_ID CITY SIZE_TYPE
dataset 2: KEY_ID POPULATION

Continue reading

Bug for remote ip in xl2tpd 1.2.6 on ubuntu 10.10

I tried to setup L2TP VPN on ubuntu 10.10 using xl2tpd, I installed xl2tpd from repository first: apt-get install xl2tpd, which gave me the version 1.2.6.

I set ip range but when I tried to connect to the VPN server, the remote ip was always 0.0.0.0 (I checkeded the /etc/log/syslog). After searching for a while, I found it’s actually a bug in xl2tpd. This bug is fixed in later version.

Continue reading