hadoop - 如何增量更新表

Question

我们正在使用 Hive 并有一个如下所示的数据流：

 SOURCE -> Flume -> S3 Buckets -> Script -> Hive Table

我们有一个看起来像这样的表，为简洁起见被截断：

 CREATE TABLE core_table (
       unique_id string
       update bigint
       other_data string
 )

现在我们也有了更新表——同样的结构core_update，这个表可能包含重复的数据（例如重复的 unique_id，但增加了 bigint，它也在文件中稍后排序）。

有没有一种好方法可以在core_update向core_table表中添加新的 unique_id 和更新基础数据的同时应用更新的内容。

-- 注意：我试图避免看起来像这样的事情：MERGE -> DEDUP，因为该过程在较小的数据集上大约需要 3 个小时，而且我们有一个非常庞大的数据集。所以做一些类似于插入排序的事情会很棒。

算法 2：更新到未分区表

Step-1 运行合并连接查询输入：mainTable，存放合并记录的暂存表名称（调用为stagingTable3），未分区的暂存表名称（调用为stagingTable2），表主键，表字段构建合并连接查询：

insert overwrite table stagingTable3 select each column in "List tableFields" Add field name with alias A from mainTable with alias A Apply the left outer join with stagingTable2 with alias B Check for where A.primaryKey = B.primaryKey and where B.primaryKey is无效的

然后与从 stagingTable2 中选择的数据联合
步骤 2：使用以下给定的加载查询，通过从 stagingTable3 覆盖到 mainTable 来加载数据：
```
load data inpath stagingTable3 overwrite into mainTable
```

然而，这仍然不太有意义（或在我的解释中起作用）。

0 回答 0