apache-spark - 如何为 saveAsTable 使用不同的 Hive Metastore？

Question

我正在使用使用 PySpark 的 Spark SQL (Spark 1.6.1)，并且我需要从一个 Hive 元存储加载表并将数据帧的结果写入另一个 Hive 元存储。

我想知道如何为一个 spark SQL 脚本使用两个不同的元存储？

这是我的脚本的样子。

# Hive metastore 1
sc1 = SparkContext()
hiveContext1 = HiveContext(sc1)
hiveContext1.setConf("hive.metastore.warehouse.dir", "tmp/Metastore1")

#Hive metastore 2
sc2 = SparkContext()
hiveContext2 = HiveContext(sc2)
hiveContext2.setConf("hive.metastore.warehouse.dir", "tmp/Metastore2")

#Reading from a table presnt in metastore1
df_extract = hiveContext1.sql("select * from emp where emp_id =1")

# Need to write the result into a different dataframe
df_extract.saveAsTable('targetdbname.target_table',mode='append',path='maprfs:///abc/datapath...')

score 1 · Accepted Answer

HotelsDotCom 专门为此https://github.com/HotelsDotCom/waggle-dance开发了一个应用程序 (WaggleDance) 。使用它作为代理，您应该能够实现您想要做的事情

score 0 · Accepted Answer

TL;DR不能使用一个 Hive 元存储（用于某些表）和另一个（用于其他表）。

由于 Spark SQL 支持单个 Hive 元存储（在SharedState中），无论SparkSessions读取和写入不同 Hive 元存储的次数在技术上是不可能的。

apache-spark - 如何为 saveAsTable 使用不同的 Hive Metastore？

2 回答 2

Related

Reference