Author Archives: purplechun

怀念仙剑-词一首

2007.05.27 作于北大,最初写在校内网,拷贝至此留念    

仙字一把,剑御前

万千个小鬼寻衅

呵声怒,冲它个乱,影不全

灵绕,柔意里是那笑颜

月洒,夜半衣披肩

还有奴,精怪翩翩

情满,不羡神仙

身去,泪予谁?搁浅

也罢

一切本是因缘

仙字难圆,剑敛

然这份感动,长留心间

————————-

一下午,脑中总是浮现出仙剑的画面,月如在圣姑床上冷冰冰的身体,灵儿与水魔兽的同归于尽,月如抱着忆如出现在雪地……“我们三个永远不要分开”“吃到老,玩到老”,月如的幸福是那样的让人心碎。我总和别人说我是灵儿派的,可是哪分那么清楚,他们给我带来的感动,已无法衡量。

Understand Bayes Theorem (prior/likelihood/posterior/evidence)

Bayes Theorem is a very common and fundamental theorem used in Data mining and Machine learning. Its formula is pretty simple:

P(X|Y) = ( P(Y|X) * P(X) ) / P(Y), which is Posterior = ( Likelihood * Prior ) /  Evidence

So I was wondering why they are called correspondingly like that.

Let’s use an example to find out their meanings.

Continue reading

Use a lookup HashMap in hive script with UDF

I was using custom jar for my mapreduce job in the past few years, and because it’s pure java programming, I have a lot of flexibility. But writing java results in a lot of code to maintain, and most of the mapreduce jobs are just joining with a little spice in it, so moving to Hive may be a better path.

Problem

The mapreduce job I face here is to left outer join two different datasets using the same keys, because it’s a outer join, there will be null values, and for these null values, I want to lookup the default values to assign from a map.

For example, I have two datasets:
dataset 1: KEY_ID CITY SIZE_TYPE
dataset 2: KEY_ID POPULATION

Continue reading

Bug for remote ip in xl2tpd 1.2.6 on ubuntu 10.10

I tried to setup L2TP VPN on ubuntu 10.10 using xl2tpd, I installed xl2tpd from repository first: apt-get install xl2tpd, which gave me the version 1.2.6.

I set ip range but when I tried to connect to the VPN server, the remote ip was always 0.0.0.0 (I checkeded the /etc/log/syslog). After searching for a while, I found it’s actually a bug in xl2tpd. This bug is fixed in later version.

Continue reading

《To the Moon》, 玩游戏久违的感动

《To the Moon》是那种你第一眼看上去会嗤之以鼻的游戏,简单略显粗糙的画面让你觉得至少回到了十年前。但正是这样一款70MB、16bit画面的游戏,被Gamespot评为2011年最佳剧情独立游戏。我想这样一款外表平凡的游戏能获此殊荣,一定有特别过人之处,再加上评论说游戏音轨特别的赞,于是一冲动就去官网花12刀买了正版,支持一下原创独立游戏。

Continue reading

Use Hadoop DistributedCache to cache files in MapReduce

DistributedCache is a very useful Hadoop feature that enables you to pass resource files to each mapper or reducer.

For example, you have a file stopWordList.txt that contains all the stop words you want to exclude when you do word count. And In your reducer, you want to check each value passed by mapper, if the value appears in the stop word list, we pass it and goes to the next value.

In order to use DistributedCache, first you need to set the file in the job configuration driver:

Continue reading

Wordcount mapreduce example using Hive on local and EMR

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.

In short, you can run a Hadoop MapReduce using SQL-like statements with Hive.

Here is an WordCount example I did using Hive. The example first shows how to do it on your Local machine, then I will show how to do it using Amazon EMR.

Local

1. Install Hive.

First you need to install Hadoop on your local, here is a post for how to do it. After you installed Hadoop, you can use this official tutorial.

Continue reading