I am facing a kind of wierd issue. Most of places I found that when lastmodified is used old and new files will be merged to remove duplicated. How ever in my case it is not happening.
I used :
sqoop import --connect "jdbc:mysql://<hostname>:3306/<dbname>" --username root -password password --table LoginRoles --hive-import --create-hive-table --hive-table LoginRoles --hive-delims-replacement " "
Table got created and data was loaded properly in /user/hive/warehouse
location.
LoginRoleId LoginRole CreatedDate ModifiedDate
1 admin1 2013-09-30 14:21:28 2013-09-30 16:03:39
2 admin2 2013-09-30 14:36:23 2013-09-30 15:53:19
3 admin3 2013-09-30 14:39:13 2013-09-30 14:39:13
4 admin5 2013-09-30 14:40:55 2013-09-30 14:40:55
- Now I ran below query and Modified date got updated to
'2013-09-30 17:03:44'
update loginroles set ModifiedDate=now(),loginrole="admin4" where LoginRoleID=4;
- When I ran the job as below using
Sqoop job -exec mymodified
sqoop job --create mymodified -- import --connect "jdbc:mysql://<hostname>:3306/<dbname>" --username root -password password --table LoginRoles --hive-import --hive-table LoginRoles --hive-delims-replacement " " --check-column ModifiedDate --incremental lastmodified --last-value '2013-09-30 16:03:39'
I see total 5 rows in hive as below.
1 admin1 2013-09-30 14:21:28.0 2013-09-30 16:03:39.0
4 admin4 2013-09-30 14:40:55.0 2013-09-30 17:03:44.0
2 admin2 2013-09-30 14:36:23.0 2013-09-30 15:53:19.0
3 admin3 2013-09-30 14:39:13.0 2013-09-30 14:39:13.0
4 admin5 2013-09-30 14:40:55.0 2013-09-30 14:40:55.0
I am sure I am missing something important and subtle.
Version details of sqoop used
Sqoop 1.4.3-cdh4.3.0
git commit id 7a52f9aa97cba43aae8b700f7e93f97dcdb0b21a
Compiled by jenkins on Mon May 27 20:33:21 PDT 2013