sql - sqoop 增量计数差异

Question

我想使用 sqoop 将我的表的所有新行导入到 hive 表，我没有列用于我的增量更新的问题。

因此，我尝试计算表的所有线数，并将其存储到带有时间戳列的配置单元中。
比我选择该数字的最大值并将其与我的源表的行数进行比较。

我的问题是，如何使用 sqoop 来导入我的配置单元表和源表之间的差异？

score 0 · Accepted Answer

这个想法是通过某些列或所有列左连接两个数据集，然后找到右侧为空的位置，以便我们只有要加载的新记录
您可以按照以下步骤操作

1) The initial load data (previous day data) is in hdfs  - Relation A
2) Import the current data into HDFS using sqoop -- Relation B
3) Use pig Load the above two hdfs directories in relation A and B define schema.
4) Convert them to tuples and join them by all columns
5) The join result will have two tuples in each row((A,B),(A,B)) , fetch the result from join where tuple B is null ((A,D),).
6) Now flatten the join by tuple A you will have new/updated records(A,D).

sql - sqoop 增量计数差异

1 回答 1

Related

Reference