We all know that Hive read/write data in folder level, the limit here is that by default it will only read/write the files from/to the folder specified. But sometimes, our input data are organized by using subfolders, then Hive cannot read them if you only specify the root folder; or you want to output to separate folders instead of putting all the output data in the same folder.
For example, we have sales data dumped to hdfs(or s3), and their path structure is like sales/city=BEIJING/day=20140401/data.tsv , as you can see, the data is partitioned by city and day, although we can copy all the data.tsv to the same folder, we need to do the copy and change the filename to avoid conflict, it will be a pain if the files are a lot and huge. On the other hand, even if we do copy all the data.tsv to the same folder, when output, we want to separate the output to different folders by city and day, how to do that?
Can hive be smart enough to read all the subfolder’s data and output to separate folders? The answer is Yes.
I was using custom jar for my mapreduce job in the past few years, and because it’s pure java programming, I have a lot of flexibility. But writing java results in a lot of code to maintain, and most of the mapreduce jobs are just joining with a little spice in it, so moving to Hive may be a better path.
The mapreduce job I face here is to left outer join two different datasets using the same keys, because it’s a outer join, there will be null values, and for these null values, I want to lookup the default values to assign from a map.
For example, I have two datasets:
dataset 1: KEY_ID CITY SIZE_TYPE
dataset 2: KEY_ID POPULATION
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.
In short, you can run a Hadoop MapReduce using SQL-like statements with Hive.
Here is an WordCount example I did using Hive. The example first shows how to do it on your Local machine, then I will show how to do it using Amazon EMR.
1. Install Hive.
First you need to install Hadoop on your local, here is a post for how to do it. After you installed Hadoop, you can use this official tutorial.