我们在 HDI 4.0 上使用 Hive 3.1.x 集群,其中 1 是 LLAP,另一个是 Just HIVE。
我们在两个集群上创建了一个托管表,行数为272409
.
在两个集群上合并之前
+---------------------+------------+---------------------+------------------------+------------------------+
| order_created_date | col_count | col_distinct_count | min_lmd | max_lmd |
+---------------------+------------+---------------------+------------------------+------------------------+
| 20200615 | 272409 | 272409 | 2020-06-15 00:00:12.0 | 2020-07-26 23:42:17.0 |
+---------------------+------------+---------------------+------------------------+------------------------+
Based on the delta, we'd perform a merge operation (which updates 17 rows).
在 hive-llap 集群上合并后(压缩前)
+---------------------+------------+---------------------+------------------------+------------------------+ | order_created_date | col_count | col_distinct_count | min_lmd | max_lmd | +---------------------+------------+---------------------+------------------------+------------------------+ | 20200615 | 272409 | 272392 | 2020-06-15 00:00:12.0 | 2020-07-27 22:52:34.0 | +---------------------+------------+---------------------+------------------------+------------------------+
在 hive-llap 集群上合并后(压缩后)
+---------------------+------------+---------------------+------------------------+------------------------+ | order_created_date | col_count | col_distinct_count | min_lmd | max_lmd | +---------------------+------------+---------------------+------------------------+------------------------+ | 20200615 | 272409 | 272409 | 2020-06-15 00:00:12.0 | 2020-07-27 22:52:34.0 | +---------------------+------------+---------------------+------------------------+------------------------+
仅在 hive 集群上合并后(不压缩增量)
+---------------------+------------+---------------------+------------------------+------------------------+
| order_created_date | col_count | col_distinct_count | min_lmd | max_lmd |
+---------------------+------------+---------------------+------------------------+------------------------+
| 20200615 | 272409 | 272409 | 2020-06-15 00:00:12.0 | 2020-07-27 22:52:34.0 |
+---------------------+------------+---------------------+------------------------+------------------------+
这是观察到的不一致
但是,在 hive-llap 上压缩表后,没有看到结果集不一致,两个集群都返回相同的结果。
We thought it might be due to either caching or llap issue, so we restarted the hive-server2 process which will clear the cache. The issue is still persistent.
We also created a dummy table with same schema on just hive cluster and pointed the location of that table to that of llap one, which in turn is producing result as expected.
We even queried on spark using **Qubole spark-acid reader** (direct hive managed table reader), which is also producing expected result
这是非常奇怪和奇特的,有人可以在这里帮忙。