1

我们在 HDI 4.0 上使用 Hive 3.1.x 集群,其中 1 是 LLAP,另一个是 Just HIVE。

我们在两个集群上创建了一个托管表,行数为272409.

在两个集群上合并之前

+---------------------+------------+---------------------+------------------------+------------------------+
| order_created_date  | col_count  | col_distinct_count  |        min_lmd         |        max_lmd         |
+---------------------+------------+---------------------+------------------------+------------------------+
| 20200615            | 272409     | 272409              | 2020-06-15 00:00:12.0  | 2020-07-26 23:42:17.0  |
+---------------------+------------+---------------------+------------------------+------------------------+

Based on the delta, we'd perform a merge operation (which updates 17 rows).

在 hive-llap 集群上合并后(压缩前)

+---------------------+------------+---------------------+------------------------+------------------------+
| order_created_date  | col_count  | col_distinct_count  |        min_lmd         |        max_lmd         |
+---------------------+------------+---------------------+------------------------+------------------------+
| 20200615            | 272409     | 272392              | 2020-06-15 00:00:12.0  | 2020-07-27 22:52:34.0  |
+---------------------+------------+---------------------+------------------------+------------------------+

在 hive-llap 集群上合并后(压缩后)

+---------------------+------------+---------------------+------------------------+------------------------+
| order_created_date  | col_count  | col_distinct_count  |        min_lmd         |        max_lmd         |
+---------------------+------------+---------------------+------------------------+------------------------+
| 20200615            | 272409     | 272409              | 2020-06-15 00:00:12.0  | 2020-07-27 22:52:34.0  |
+---------------------+------------+---------------------+------------------------+------------------------+

仅在 hive 集群上合并后(不压缩增量)

+---------------------+------------+---------------------+------------------------+------------------------+
| order_created_date  | col_count  | col_distinct_count  |        min_lmd         |        max_lmd         |
+---------------------+------------+---------------------+------------------------+------------------------+
| 20200615            | 272409     | 272409              | 2020-06-15 00:00:12.0  | 2020-07-27 22:52:34.0  |
+---------------------+------------+---------------------+------------------------+------------------------+

这是观察到的不一致

但是,在 hive-llap 上压缩表后,没有看到结果集不一致,两个集群都返回相同的结果。

We thought it might be due to either caching or llap issue, so we restarted the hive-server2 process which will clear the cache. The issue is still persistent.

We also created a dummy table with same schema on just hive cluster and pointed the location of that table to that of llap one, which in turn is producing result as expected.

We even queried on spark using **Qubole spark-acid reader** (direct hive managed table reader), which is also producing expected result

这是非常奇怪和奇特的,有人可以在这里帮忙。

4

2 回答 2

2

我们在 HDInsight Hive llap 集群中也遇到了类似的问题。设置hive.llap.io.enabled为已false解决问题

于 2020-08-14T06:58:16.387 回答
0

Qubole 尚不支持 Hive LLAP。(但是,我们(在 Qubole)正在评估是否在未来支持这一点)

于 2020-08-04T16:12:18.847 回答