hive - 当底层 HDFS 集群不再存在时如何从配置单元元存储中删除数据库

Question

我正在使用临时 GCP Dataproc 集群（Apache Spark 2.2.1、Apache Hadoop 2.8.4 和 Apache Hive 2.1.1）。这些集群都指向同一个 Hive Metastore（托管在 Google Cloud SQL 实例上）。

我在一个这样的集群上创建了一个数据库，并将其位置设置为“HDFS:///database_name”，如下所示：

$ gcloud dataproc jobs submit hive \
    -e "create database db_name LOCATION 'hdfs:///db_name'" \
    --cluster=my-first-ephemeral-cluster --region=europe-west1

my-first-ephemeral-cluster然后被删除，并随之删除了相关的 HDFS。

在所有后续集群上，此后一直弹出以下错误：

u'java.net.UnknownHostException: my-first-ephemeral-cluster-m'

这可能是因为 Hive Metastore 现在有一个不存在的位置条目。尝试删除损坏的数据库也是不行的：

$ gcloud dataproc jobs submit hive \
    -e 'drop database db_name' \
    --cluster=my-second-ephemeral-cluster --region=europe-west1

Job [4462cb1d-88f2-4e2b-8a86-c342c0ce46ee] submitted.
Waiting for job output...
Connecting to jdbc:hive2://my-second-ephemeral-cluster-m:10000
Connected to: Apache Hive (version 2.1.1)
Driver: Hive JDBC (version 2.1.1)
18/11/03 13:40:04 [main]: WARN jdbc.HiveConnection: Request to set autoCommit to false; Hive does not support autoCommit=false.
Transaction isolation: TRANSACTION_REPEATABLE_READ
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.lang.IllegalArgumentException: java.net.UnknownHostException: my-first-ephemeral-cluster-m) (state=08S01,code=1)
Closing: 0: jdbc:hive2://my-second-ephemeral-cluster-m:10000

原因是主机my-first-ephemeral-cluster-m不再有效。由于更改数据库的位置不是我正在使用的配置单元版本中的选项，因此我需要不同的解决方法来删除此数据库。

score 1 · Accepted Answer

https://cwiki.apache.org/confluence/display/Hive/Hive+MetaTool

Hive MetaTool 使管理员能够对元存储中的数据库、表和分区记录中的位置字段进行批量更新

(...) 示例 (...)
./hive --service metatool -updateLocation hdfs://localhost:9000 hdfs://namenode2:8020

但首先，您需要知道伪 HDFS 路径是如何以“规范”形式保存在 Metastore 中的，例如hdfs://my-first-ephemeral-cluster-m/db_name（如果 Google 在某种程度上遵循 Hadoop 标准）

score 0 · Accepted Answer

Since my point of view, the correct way to delete the Hive metastore entry that causes error is removing the database just before you delete the cluster my-first-ephemeral-cluster, for example an script with this sequence:

gcloud dataproc jobs submit hive -e 'drop database db_name' --cluster=my-first-ephemeral-cluster --region=europe-west1
gcloud dataproc clusters delete my-first-ephemeral-cluster

However, I found instructions of Cloud SQL proxy for setting up a shared hive warehouse between different Dataproc clusters using cloud storage (instead of LOCATION 'hdfs:///db_name' that creates the hive warehouse in the local HDFS), which could give you a behavior like the one you are looking for.

score 0 · Accepted Answer

我在 Dataproc 中创建了一个同名集群，以删除使用 HDFS 中的某个位置创建的架构。

hive - 当底层 HDFS 集群不再存在时如何从配置单元元存储中删除数据库

3 回答 3

Related

Reference