Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.
In short, you can run a Hadoop MapReduce using SQL-like statements with Hive.
Here is an WordCount example I did using Hive. The example first shows how to do it on your Local machine, then I will show how to do it using Amazon EMR.
Local
1. Install Hive.
First you need to install Hadoop on your local, here is a post for how to do it. After you installed Hadoop, you can use this official tutorial.
*2. This step may not needed, in case you meet error says the IP address cannot be accessed, go to your Hadoop folder, edit the conf/core-site.xml, change fs.default.name from IP to your hostname (for me it’s http://localhost.localdomain:9000
3. Write mapper & reducer for our WordCount example, here I use python, you can use any script languages you like.
Mapper: (word_count_mapper.py)
#!/usr/bin/python
import sys
for line in sys.stdin:
line = line.strip();
words = line.split(" ");
# write the tuples to stdout
for word in words:
print '%s\t%s' % (word, "1")
Reducer: (word_count_reducer.py)
#!/usr/bin/python
import sys
# maps words to their counts
word2count = {}
for line in sys.stdin:
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
continue
try:
word2count[word] = word2count[word]+count
except:
word2count[word] = count
# write the tuples to stdout
# Note: they are unsorted
for word in word2count.keys():
print '%st%s'% ( word, word2count[word] )
4. Write the Hive script word_count.hql. Note: you can run the following codes line by line in Hive console as well.
drop table if exists raw_lines;
-- create table raw_line, and read all the lines in '/user/inputs', this is the path on your local HDFS
create external table if not exists raw_lines(line string)
ROW FORMAT DELIMITED
stored as textfile
location '/user/inputs';
drop table if exists word_count;
-- create table word_count, this is the output table which will be put in '/user/outputs' as a text file, this is the path on your local HDFS
create external table if not exists word_count(word string, count int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
lines terminated by '\n' STORED AS TEXTFILE LOCATION '/user/outputs/';
-- add the mapper&reducer scripts as resources, please change your/local/path
add file your/local/path/word_count_mapper.py;
add file your/local/path/word_count_reducer.py;
from (
from raw_lines
map raw_lines.line
--call the mapper here
using 'word_count_mapper.py'
as word, count
cluster by word) map_output
insert overwrite table word_count
reduce map_output.word, map_output.count
--call the reducer here
using 'word_count_reducer.py'
as word,count;
5. Put some text files on HDFS ‘/user/inputs/’ using Hadoop commandline (Hadoop dfs -copyFromLocal source destination)
6. Run your script!
hive -f word_count.hql
The script will create 2 tables, read input data in raw_lines table and add mapper & reducer scripts as resources; do the MapReduce and store the data in word_count table, which you can find the text file in ‘/user/outputs’.
IN case you meet the safe mode error, you can close the safe mode manually:
hadoop dfsadmin -safemode leave
IMPORTANT:
In your script file, PLEASE do not for get to add “#!/usr/bin/python” at the first line. I forgot to add and met this error, which cost me half an hour to figure out why…
Starting Job = job_201206131927_0006, Tracking URL = http://domU-12-31-39-03-BD-57.compute-1.internal:9100/jobdetails.jsp?jobid=job_201206131927_0006 Kill Command = /home/hadoop/bin/hadoop job -Dmapred.job.tracker=10.249.190.165:9001 -kill job_201206131927_0006 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2012-06-13 20:56:15,119 Stage-1 map = 0%, reduce = 0% 2012-06-13 20:57:10,489 Stage-1 map = 100%, reduce = 100% Ended Job = job_201206131927_0006 with errors Error during job, obtaining debugging information... Examining task ID: task_201206131927_0006_m_000002 (and more) from job job_201206131927_0006 Exception in thread "Thread-120" java.lang.RuntimeException: Error while reading from task log url at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:130) at org.apache.hadoop.hive.ql.exec.JobDebugger.showJobFailDebugInfo(JobDebugger.java:211) at org.apache.hadoop.hive.ql.exec.JobDebugger.run(JobDebugger.java:81) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.IOException: Server returned HTTP response code: 400 for URL: http://10.248.42.34:9103/tasklog?taskid=attempt_201206131927_0006_m_000000_2&start=-8193 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436) at java.net.URL.openStream(URL.java:1010) at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:120) ... 3 more Counters: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 HDFS Read: 0 HDFS Write: 0 FAIL Total MapReduce CPU Time Spent: 0 msec
Amazon EMR
Running Hive script on EMR is very simple actually. I will use pictures to show how I did.
Here is the code I modified for EMR:
create external table if not exists raw_lines(line string)
ROW FORMAT DELIMITED
stored as TEXTFILE
LOCATION '${INPUT}';
create external table if not exists word_count(word string, count int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
lines terminated by 'n'
STORED AS TEXTFILE
LOCATION '${OUTPUT}';
from (
from raw_lines
map raw_lines.line
using '${SCRIPT}/word_count_mapper.py'
as word, count
cluster by word) map_output
insert overwrite table word_count
reduce map_output.word, map_output.count
using '${SCRIPT}/word_count_reducer.py'
as word,count;
Note in the script, I use INPUT, OUTPUT, SCRIPT variables, INPUT/OUTPUT are set by EMR automatically in the step (2) below, SCRIPT is set by me in the Extra args.
All files are stored in S3.
2. Set the Hive script path and arguments

5. Run the Job!



Nice blog !!!
Pingback: Pig and Hive Setup in Hadoop 2.2.0 | TechStudy