{"id":635,"date":"2014-04-17T17:08:04","date_gmt":"2014-04-17T21:08:04","guid":{"rendered":"http:\/\/www.lichun.cc\/blog\/?p=635"},"modified":"2014-04-17T17:09:40","modified_gmt":"2014-04-17T21:09:40","slug":"use-hive-partition-to-read-write-with-subfolders","status":"publish","type":"post","link":"https:\/\/www.lichun.cc\/blog\/2014\/04\/use-hive-partition-to-read-write-with-subfolders\/","title":{"rendered":"use Hive Partition to Read\/Write with subfolders"},"content":{"rendered":"<p>We all know that Hive read\/write data in folder level, the limit here is that by default it will only read\/write the files from\/to the folder specified. But sometimes, our input data are organized by using subfolders, then Hive cannot read them if you only specify the root folder; or you want to output to separate folders instead of putting all the output data in the same folder.<\/p>\n<p>For example, we have <b>sales<\/b> data dumped to hdfs(or s3), and their path structure is like <b> sales\/city=BEIJING\/day=20140401\/data.tsv <\/b>, as you can see, the data is partitioned by <b>city<\/b> and <b>day<\/b>, although we can copy all the <b>data.tsv<\/b> to the same folder, we need to do the copy and change the filename to avoid conflict, it will be a pain if the files are a lot and huge. On the other hand, even if we do copy all the <b>data.tsv<\/b> to the same folder, when output, we want to separate the output to different folders by <b>city<\/b> and <b>day<\/b>, how to do that?<\/p>\n<p>Can hive be smart enough to read all the subfolder&#8217;s data and output to separate folders? The answer is <b>Yes<\/b>.<\/p>\n<p><!--more--><\/p>\n<p>Let&#8217;s use an example to see how to do this.<\/p>\n<h1>Sample Data:<\/h1>\n<p>In the following example, we try to do this. Given the SALES data and PRODUCT_COSTS data, what is the TOTAL PROFIT of each product in each city on each day.<\/p>\n<h2>SALES data<\/h2>\n<p>Format: ID SALE_CITY PRODUCT_NAME QUANTITY SALE_PRICE SNAPSHOT_DAY<\/p>\n<h2>PRODUCT_COSTS data<\/h2>\n<p>Format: ID PRODUCT_NAME PRODUCT_COST SNAPSHOT_DAY<\/p>\n<h2>PROFIT OUTPUT<\/h2>\n<p><b> The the Profit of each record is: (SALE_PRICE &#8211; PRODUCT_COST) * QUANTITY <\/b><br \/>\nFormat: PRODUCT_NAME QUANTITY TOTAL_PROFIT<\/p>\n<h1>1.Hive configuration<\/h1>\n<p>At first, we need to enable the <b>dynamic partition<\/b> functionality of HIVE.<br \/>\nput the following 2 lines at the top of your hive script:<\/p>\n<pre>set hive.exec.dynamic.partition=true;\r\nset hive.exec.dynamic.partition.mode=nonstrict;\r\n<\/pre>\n<h1>2.Input part<\/h1>\n<p>For the input, we can only specify the <b>LOCATION<\/b> to be the root folder, and when we create the table, we need to specify the <b>partitions<\/b>.<\/p>\n<p>Here is the code:<\/p>\n<pre>DROP TABLE IF EXISTS SALES;\r\nCREATE EXTERNAL TABLE SALES (\r\n    ID                   BIGINT,\r\n    SALE_CITY            STRING,\r\n    PRODUCT_NAME         STRING,\r\n    QUANTITY             INT,\r\n    SALE_PRICE           DOUBLE,\r\n    SNAPSHOT_DAY         STRING\r\n)\r\n<b>PARTITIONED BY (CITY STRING, DAY STRING)<\/b>\r\nROW FORMAT DELIMITED FIELDS TERMINATED BY '\\t' LINES TERMINATED BY '\\n'\r\nSTORED AS TEXTFILE LOCATION \"\/input\/sales\/\";\r\n<\/pre>\n<p>To let HIVE actually know all the partitions under the root folder, there are 2 ways.<\/p>\n<h2>(a) Manually add all the partitions<\/h2>\n<p>Unfortunately, Current Hive cannot be smart enough to automatically recognize all the partitions, we have to manually add them all.<\/p>\n<pre>ALTER TABLE SALES ADD PARTITION (city='BEIJING', day='20140401');\r\nALTER TABLE SALES ADD PARTITION (city='BEIJING', day='20140402');\r\nALTER TABLE SALES ADD PARTITION (city='TAIYUAN', day='20140401');\r\nALTER TABLE SALES ADD PARTITION (city='TAIYUAN', day='20140402');\r\n<\/pre>\n<p>It will be annoying if there are a lot more partitions. But sorry there is no better way using non-modified HIVE.<\/p>\n<h2>(b) Use AWS EMR to run your HIVE script<\/h2>\n<p>AWS EMR&#8217;s modified HIVE has a very convenient command to let HIVE recognize all the partitions automatically. Just add the following line after the table creation.<\/p>\n<pre>ALTER TABLE SALES RECOVER PARTITIONS;\r\n<\/pre>\n<p>And then HIVE will load all the data following the <b>partition rules(xxx=yyy pattern)<\/b> under the given root folder.<\/p>\n<p>Here is the code for <b>PRODUCT_COST<\/b> table, there is no partition on it,<\/p>\n<pre>DROP TABLE IF EXISTS PRODUCT_COST;\r\nCREATE EXTERNAL TABLE PRODUCT_COST (\r\n    ID                   BIGINT,\r\n    PRODUCT_NAME         STRING,\r\n    PRODUCT_COST         DOUBLE,\r\n    SNAPSHOT_DAY         STRING\r\n)\r\nROW FORMAT DELIMITED FIELDS TERMINATED BY '\\t' LINES TERMINATED BY '\\n'\r\nSTORED AS TEXTFILE LOCATION \"\/input\/product_cost\/\";\r\n<\/pre>\n<h1>3.Output part<\/h1>\n<p>For output table, we also need to specify the <b>partitions<\/b> when we create the table.<\/p>\n<pre>DROP TABLE IF EXISTS PROFIT;\r\nCREATE EXTERNAL TABLE PROFIT (\r\n    PRODUCT_NAME         STRING,\r\n    QUANTITY             INT,\r\n    TOTAL_PROFIT         DOUBLE\r\n)\r\n<b>PARTITIONED BY (CITY STRING, DAY STRING)<\/b>\r\nROW FORMAT DELIMITED FIELDS TERMINATED BY '\\t' LINES TERMINATED BY '\\n'\r\nSTORED AS TEXTFILE LOCATION \"\/output\/\";\r\n<\/pre>\n<p>When you insert the data to the output table, <b>(1) we need to tell HIVE that you&#8217;re writing to partitions, (2) we need append the partition data to the end of each output row.<\/b><\/p>\n<pre>INSERT OVERWRITE TABLE PROFIT <b>PARTITION (CITY, DAY)<\/b>\r\nSELECT\r\n    s.PRODUCT_NAME,\r\n    SUM(s.QUANTITY) QUANTITY,\r\n    SUM( (s.SALE_PRICE - p.PRODUCT_COST) * s.QUANTITY ) TOTAL_PROFIT,\r\n    <b>s.SALE_CITY,<\/b>\r\n    <b>s.SNAPSHOT_DAY<\/b>\r\nFROM\r\nSALES s\r\nJOIN\r\nPRODUCT_COST p\r\nON (s.SNAPSHOT_DAY = p.SNAPSHOT_DAY AND s.PRODUCT_NAME = p.PRODUCT_NAME)\r\nGROUP BY s.SNAPSHOT_DAY, s.PRODUCT_NAME, s.SALE_CITY\r\n;\r\n<\/pre>\n<p>For (1), we added <b>PARTITION (CITY, DAY)<\/b> after the <b>INSERT OVERWRITE TABLE xxx<\/b> statement.<br \/>\nFor (2), we appended the <b>SALE_CITY<\/b>, <b>SNAPSHOT_DAY<\/b> at the end of the record in the order the partitions are specified.<\/p>\n<p>By doing this, Hive will separate the data by the partitions and output it to separate folders using the <b>partition rules(xxx=yyy pattern)<\/b>. Here it will output to <b>\/output\/city=BEIJING\/day=20140401 &#8230;<\/b><\/p>\n<p><a href=\"http:\/\/www.lichun.cc\/blog\/wp-content\/uploads\/2014\/04\/hive_partition.tar.gz\">Here<\/a> are the sample codes, I tested it on my local environment and on AWS EMR cluster, and it produces correct output.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We all know that Hive read\/write data in folder level, the limit here is that by default it will only read\/write the files from\/to the folder specified. But sometimes, our input data are organized by using subfolders, then Hive cannot read them if you only specify the root folder; or you want to output to [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_mi_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true},"categories":[19],"tags":[16,54,83],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p2s9sh-af","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/www.lichun.cc\/blog\/wp-json\/wp\/v2\/posts\/635"}],"collection":[{"href":"https:\/\/www.lichun.cc\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.lichun.cc\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.lichun.cc\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.lichun.cc\/blog\/wp-json\/wp\/v2\/comments?post=635"}],"version-history":[{"count":19,"href":"https:\/\/www.lichun.cc\/blog\/wp-json\/wp\/v2\/posts\/635\/revisions"}],"predecessor-version":[{"id":655,"href":"https:\/\/www.lichun.cc\/blog\/wp-json\/wp\/v2\/posts\/635\/revisions\/655"}],"wp:attachment":[{"href":"https:\/\/www.lichun.cc\/blog\/wp-json\/wp\/v2\/media?parent=635"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.lichun.cc\/blog\/wp-json\/wp\/v2\/categories?post=635"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.lichun.cc\/blog\/wp-json\/wp\/v2\/tags?post=635"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}