apache-spark - 如何从 spark 连接到远程配置单元服务器

Question

我在本地运行 spark 并希望访问位于远程 Hadoop 集群中的 Hive 表。

我可以通过在 SPARK_HOME 下启动 beeline 来访问蜂巢表

[ml@master spark-2.0.0]$./bin/beeline 
Beeline version 1.2.1.spark2 by Apache Hive
beeline> !connect jdbc:hive2://remote_hive:10000
Connecting to jdbc:hive2://remote_hive:10000
Enter username for jdbc:hive2://remote_hive:10000: root
Enter password for jdbc:hive2://remote_hive:10000: ******
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/ml/spark/spark-2.0.0/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/10/12 19:06:39 INFO jdbc.Utils: Supplied authorities: remote_hive:10000
16/10/12 19:06:39 INFO jdbc.Utils: Resolved authority: remote_hive:10000
16/10/12 19:06:39 INFO jdbc.HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://remote_hive:10000
Connected to: Apache Hive (version 1.2.1000.2.4.2.0-258)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://remote_hive:10000>

如何从 spark 以编程方式访问远程配置单元表？

score 23 · Accepted Answer

不需要 JDBC

Spark 直接连接到 Hive 元存储，而不是通过 HiveServer2。要配置这个，

穿上hive-site.xml您的classpath, 并指定hive.metastore.uris 到您的配置单元元存储托管的位置。另请参阅如何在 SparkSQL 中以编程方式连接到 Hive 元存储？
Import org.apache.spark.sql.hive.HiveContext，因为它可以对 Hive 表执行 SQL 查询。
定义val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
验证sqlContext.sql("show tables")它是否有效

Hive 表上的 SparkSQL

结论：如果您必须使用 jdbc 方式

看看远程连接 apache spark 和 apache hive。

请注意，beeline 也通过 jdbc 连接。从您的日志中可以看出。

[ml@master spark-2.0.0]$./bin/beeline Beeline 版本 1.2.1.spark2 by Apache Hive beeline> !connect jdbc:hive2://remote_hive:10000

连接到 jdbc:hive2://remote_hive:10000

所以请看看这篇有趣的文章

方法一：使用 JDBC 将表拉入 Spark
方法二：使用 Spark JdbcRDD 和 HiveServer2 JDBC 驱动
方法3：在客户端获取数据集，然后手动创建RDD

目前 HiveServer2 驱动程序不允许我们使用“Sparkling”方法 1 和 2，我们只能依赖方法 3

下面是可以实现的示例代码片段

通过 HiveServer2 JDBC 连接将数据从一个 Hadoop 集群（又名“远程”）加载到另一个集群（我的 Spark 所在的地方，又名“国内”）。

import java.sql.Timestamp
import scala.collection.mutable.MutableList

case class StatsRec (
  first_name: String,
  last_name: String,
  action_dtm: Timestamp,
  size: Long,
  size_p: Long,
  size_d: Long
)

val conn: Connection = DriverManager.getConnection(url, user, password)
val res: ResultSet = conn.createStatement
                   .executeQuery("SELECT * FROM stats_201512301914")
val fetchedRes = MutableList[StatsRec]()
while(res.next()) {
  var rec = StatsRec(res.getString("first_name"), 
     res.getString("last_name"), 
     Timestamp.valueOf(res.getString("action_dtm")), 
     res.getLong("size"), 
     res.getLong("size_p"), 
     res.getLong("size_d"))
  fetchedRes += rec
}
conn.close()
val rddStatsDelta = sc.parallelize(fetchedRes)
rddStatsDelta.cache()




 // Basically we are done. To check loaded data:

println(rddStatsDelta.count)
rddStatsDelta.collect.take(10).foreach(println)

score 1 · Accepted Answer

在向 SPARK 提供hive-ste.xml配置并启动HIVE Metastore 服务之后，

连接到 HIVE 时，需要在 SPARK Session中配置两件事：

由于 Spark SQL 使用 thrift 连接到Hive 元存储，因此我们需要在创建 Spark 会话时提供 thrift 服务器 uri。
Hive Metastore 仓库，它是 Spark SQL 保存表的目录。使用对应于“hive.metastore.warehouse.dir”的属性“spark.sql.warehouse.dir”（因为这在 Spark 2.0 中已弃用）

就像是：

    SparkSession spark=SparkSession.builder().appName("Spark_SQL_5_Save To Hive").enableHiveSupport().getOrCreate();
    spark.sparkContext().conf().set("spark.sql.warehouse.dir", "/user/hive/warehouse");
    spark.sparkContext().conf().set("hive.metastore.uris", "thrift://localhost:9083");

希望这有帮助！

score 0 · Accepted Answer

根据文档：

请注意，hive-site.xml 中的 hive.metastore.warehouse.dir 属性自 Spark 2.0.0 以来已弃用。相反，使用 spark.sql.warehouse.dir 指定仓库中数据库的默认位置。

所以在SparkSession你需要指定spark.sql.uris而不是hive.metastore.uris

    from pyspark.sql import SparkSession
    spark = SparkSession \
        .builder \
        .appName("Python Spark SQL Hive integration example") \
        .config("spark.sql.uris", "thrift://<remote_ip>:9083") \
        .enableHiveSupport() \
        .getOrCreate()
    spark.sql("show tables").show()

apache-spark - 如何从 spark 连接到远程配置单元服务器

3 回答 3

不需要 JDBC

结论：如果您必须使用 jdbc 方式

Related

Reference