HIve : Append data in HIve table

 

 

HI,

 

In this blog i will explain about how can we update a table in hive on f daily basis

Problem:-

 

We have a table in which on daily basis(not everyday) validity_starttime got changed, so we need to create a solution in which when this data get updated then these new values will append to table as well the data with updated value of validity_starttime also needs to change

 

Let’s Start ,

Create  table which is our main table

 

 

create table employee(customer_id string,customer_name string,

customer_account_balance string,

validity_starttime string,

validity_endtime string)

row format delimited

fields terminated  by  ‘,’

 

 

create an incremented data for daily update ( table in which new daily updated data comes )

create table employeeinc(customer_id string,customer_name string,

customer_account_balance string,

validity_starttime string,

validity_endtime string)

row format delimited

fields terminated  by  ‘,’

 

 

Output :- hive> create table employee(customer_id string,customer_name string,

> customer_account_balance string,

> validity_starttime string,

> validity_endtime string)

> row format delimited

> fields terminated by “,”;

OK

Time taken: 0.7 seconds

hive> create table employeeinc(customer_id string,customer_name string,

> customer_account_balance string,

> validity_starttime string,

> validity_endtime string)

> row format delimited

> fields terminated by “,”;

OK

Time taken: 0.423 seconds

 

 

Insert data :-

 

Lets we have sample data for table employee :-

 

employee,

c01,harit,1000,2016/01/26,null

c02,fahed,100,2016/01/26,null

c03,raj,2000,2016/01/26,null

c04,vikas,3000,2016/01/26,null

c05,hardy,8000,2016/01/26,null

 

employeeinc,

c01,harit,1000,2016/01/27,null

c02,fahed,100,2016/01/28,null

c06,nitish,5000,2016/01/28,null

c08,sandy,1200.,2016/01/28,null

c09,deeak,3000,,2016/01/28,null

 

 

 

hive> load data local inpath “/home/cloudera/Desktop/cust.txt” overwrite into table employee ;

Loading data to table emp.employee

Table emp.employee stats: [numFiles=1, numRows=0, totalSize=152, rawDataSize=0]

OK

Time taken: 2.826 seconds

hive> load data local inpath “/home/cloudera/Desktop/custinc.txt” overwrite into table employeeinc ;

Loading data to table emp.employeeinc

Table emp.employeeinc stats: [numFiles=1, numRows=0, totalSize=157, rawDataSize=0]

OK

Time taken: 1.75 seconds

hive> select * from employee;

OK

c01         harit       1000       2016/01/26         null

c02         fahed    100         2016/01/26         null

c03         raj           2000       2016/01/26         null

c04         vikas      3000       2016/01/26         null

c05         hardy    8000       2016/01/26         null

Time taken: 1.576 seconds, Fetched: 5 row(s)

hive> select * from employeeinc;

OK

c01         harit       1000       2016/01/27         null

c02         fahed    100         2016/01/28         null

c06         nitish     5000       2016/01/28         null

c08         sandy    1200.     2016/01/28         null

c09         deeak   3000                       2016/01/28

Time taken: 0.292 seconds, Fetched: 5 row(s)

 

 

Now create a view which will consolidate the data of two tables according to our requirement :-

 

  1. i) union both above table first
  2. ii) will select the validity_starttime in later part so that we can have the most recent

updated data

iii)join both i and ii with help of common fields customer_id

 

 

 

 

hive> create view custnew as

> select t1.* from

> (select * from employee

> union all

> select * from employeeinc

> )t1

> join

> (select customer_id ,max(validity_starttime) newvalidity_starttime

> from

> (select * from employee union all select * from employeeinc )t2

> GROUP BY customer_id)s

> on t1.customer_id =s.customer_id  and t1.validity_starttime=s.newvalidity_starttime

> ;

OK

Time taken: 0.721 seconds

hive> select * from custnew;

Query ID = cloudera_20160126005959_4d0903ad-996b-4083-bcd5-6648175af588

Total jobs = 2

Launching Job 1 out of 2

Number of reduce tasks not specified. Estimated from input data size: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

set mapreduce.job.reduces=<number>

Starting Job = job_1453707805003_0017, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1453707805003_0017/

Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1453707805003_0017

Hadoop job information for Stage-3: number of mappers: 2; number of reducers: 1

2016-01-26 01:00:01,669 Stage-3 map = 0%,  reduce = 0%

2016-01-26 01:01:02,246 Stage-3 map = 0%,  reduce = 0%

2016-01-26 01:01:19,747 Stage-3 map = 50%,  reduce = 0%, Cumulative CPU 7.05 sec

2016-01-26 01:01:30,063 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 16.68 sec

2016-01-26 01:02:05,779 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 21.56 sec

MapReduce Total cumulative CPU time: 21 seconds 560 msec

Ended Job = job_1453707805003_0017

Execution log at: /tmp/cloudera/cloudera_20160126005959_4d0903ad-996b-4083-bcd5-6648175af588.log

2016-01-26 01:02:27        Starting to launch local task to process map join;               maximum memory = 1013645312

2016-01-26 01:02:34        Dump the side-table for tag: 0 with group count: 10 into file: file:/tmp/cloudera/5dd2b621-8e59-429d-aa16-b062fbd11584/hive_2016-01-26_00-59-07_574_5762593899502030133-1/-local-10004/HashTable-Stage-5/MapJoin-mapfile00–.hashtable

2016-01-26 01:02:34        Uploaded 1 File to: file:/tmp/cloudera/5dd2b621-8e59-429d-aa16-b062fbd11584/hive_2016-01-26_00-59-07_574_5762593899502030133-1/-local-10004/HashTable-Stage-5/MapJoin-mapfile00–.hashtable (747 bytes)

2016-01-26 01:02:34        End of local task; Time Taken: 6.793 sec.

Execution completed successfully

MapredLocal task succeeded

Launching Job 2 out of 2

Number of reduce tasks is set to 0 since there’s no reduce operator

Starting Job = job_1453707805003_0018, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1453707805003_0018/

Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1453707805003_0018

Hadoop job information for Stage-5: number of mappers: 1; number of reducers: 0

2016-01-26 01:03:09,321 Stage-5 map = 0%,  reduce = 0%

2016-01-26 01:03:31,960 Stage-5 map = 100%,  reduce = 0%, Cumulative CPU 2.94 sec

MapReduce Total cumulative CPU time: 2 seconds 940 msec

Ended Job = job_1453707805003_0018

MapReduce Jobs Launched:

Stage-Stage-3: Map: 2  Reduce: 1   Cumulative CPU: 21.56 sec   HDFS Read: 17805 HDFS Write: 342 SUCCESS

Stage-Stage-5: Map: 1   Cumulative CPU: 2.94 sec   HDFS Read: 7296 HDFS Write: 243 SUCCESS

Total MapReduce CPU Time Spent: 24 seconds 500 msec

OK

c01         harit       1000       2016/01/27         null

c02         fahed    100         2016/01/28         null

c03         raj           2000       2016/01/26         null

c04         vikas      3000       2016/01/26         null

c05         hardy    8000       2016/01/26         null

c06         nitish     5000       2016/01/28         null

c08         sandy    1200.     2016/01/28         null

c09         deeak   3000                       2016/01/28

Time taken: 267.096 seconds, Fetched: 8 row(s)

hive> select * from employee;

OK

c01         harit       1000       2016/01/26         null

c02         fahed    100         2016/01/26         null

c03         raj           2000       2016/01/26         null

c04         vikas      3000       2016/01/26         null

c05         hardy    8000       2016/01/26         null

Time taken: 0.651 seconds, Fetched: 5 row(s)

hive> select * from employeeinc;

OK

c01         harit       1000       2016/01/27         null

c02         fahed    100         2016/01/28         null

c06         nitish     5000       2016/01/28         null

c08         sandy    1200.     2016/01/28         null

c09         deeak   3000                       2016/01/28

Time taken: 0.227 seconds, Fetched: 5 row(s)

 

 

 

 

Now ,

 

We need to insert a this view data into our employee table so that we can get the output we want

hive> insert overwrite table employee select * from custnew;

Query ID = cloudera_20160126011212_d50b5dbd-87f8-43c8-bacf-bffe80cc2c71

Total jobs = 2

Launching Job 1 out of 2

Number of reduce tasks not specified. Estimated from input data size: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

set mapreduce.job.reduces=<number>

Starting Job = job_1453707805003_0019, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1453707805003_0019/

Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1453707805003_0019

Hadoop job information for Stage-4: number of mappers: 2; number of reducers: 1

2016-01-26 01:13:25,005 Stage-4 map = 0%,  reduce = 0%

2016-01-26 01:14:25,151 Stage-4 map = 0%,  reduce = 0%

2016-01-26 01:14:40,245 Stage-4 map = 50%,  reduce = 0%, Cumulative CPU 11.89 sec

2016-01-26 01:14:41,384 Stage-4 map = 100%,  reduce = 0%, Cumulative CPU 14.91 sec

2016-01-26 01:15:15,618 Stage-4 map = 100%,  reduce = 67%, Cumulative CPU 19.13 sec

2016-01-26 01:15:20,786 Stage-4 map = 100%,  reduce = 100%, Cumulative CPU 22.51 sec

MapReduce Total cumulative CPU time: 22 seconds 510 msec

Ended Job = job_1453707805003_0019

Execution log at: /tmp/cloudera/cloudera_20160126011212_d50b5dbd-87f8-43c8-bacf-bffe80cc2c71.log

2016-01-26 01:15:49        Starting to launch local task to process map join;               maximum memory = 1013645312

2016-01-26 01:15:56        Dump the side-table for tag: 0 with group count: 10 into file: file:/tmp/cloudera/5dd2b621-8e59-429d-aa16-b062fbd11584/hive_2016-01-26_01-12-37_392_2818535913005964035-1/-local-10003/HashTable-Stage-6/MapJoin-mapfile10–.hashtable

2016-01-26 01:15:56        Uploaded 1 File to: file:/tmp/cloudera/5dd2b621-8e59-429d-aa16-b062fbd11584/hive_2016-01-26_01-12-37_392_2818535913005964035-1/-local-10003/HashTable-Stage-6/MapJoin-mapfile10–.hashtable (747 bytes)

2016-01-26 01:15:56        End of local task; Time Taken: 7.523 sec.

Execution completed successfully

MapredLocal task succeeded

Launching Job 2 out of 2

Number of reduce tasks is set to 0 since there’s no reduce operator

Starting Job = job_1453707805003_0020, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1453707805003_0020/

Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1453707805003_0020

Hadoop job information for Stage-6: number of mappers: 1; number of reducers: 0

2016-01-26 01:16:37,056 Stage-6 map = 0%,  reduce = 0%

2016-01-26 01:16:59,538 Stage-6 map = 100%,  reduce = 0%, Cumulative CPU 2.97 sec

MapReduce Total cumulative CPU time: 2 seconds 970 msec

Ended Job = job_1453707805003_0020

Loading data to table emp.employee

Table emp.employee stats: [numFiles=1, numRows=8, totalSize=243, rawDataSize=235]

MapReduce Jobs Launched:

Stage-Stage-4: Map: 2  Reduce: 1   Cumulative CPU: 22.51 sec   HDFS Read: 17883 HDFS Write: 342 SUCCESS

Stage-Stage-6: Map: 1   Cumulative CPU: 2.97 sec   HDFS Read: 7717 HDFS Write: 312 SUCCESS

Total MapReduce CPU Time Spent: 25 seconds 480 msec

OK

Time taken: 266.197 seconds

hive> select * from employee;

OK

c01         harit       1000       2016/01/27         null

c02         fahed    100         2016/01/28         null

c03         raj           2000       2016/01/26         null

c04         vikas      3000       2016/01/26         null

c05         hardy    8000       2016/01/26         null

c06         nitish     5000       2016/01/28         null

c08         sandy    1200.     2016/01/28         null

c09         deeak   3000                       2016/01/28

Time taken: 0.288 seconds, Fetched: 8 row(s)

 

Note :- By Mistake , i created employee and employeeinc instead of customer and customerinc.

Algo :-

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s