0

我可以使用以下方法在 spark 1.6.0 上以编程方式创建配置单元上下文:

val conf = new SparkConf().setAppName("SparkTest").setMaster("local")
val sc=new SparkContext(conf)
val hc = new HiveContext(sc)
val actualRecordCountHC = hc.sql("select count(*) from hiveorc_replica.appointment")

这对我来说很好。同样,我想在 spark 2.3.0 上创建一个配置单元上下文,但是在运行程序时,它会抛出以下错误:

org.apache.spark.sql.AnalysisException: 
Table or view not found: `hiveorc_replica`.`appointment`; line 1 pos 21;
'Aggregate [unresolvedalias(count(1), None)]
    'UnresolvedRelation `hiveorc_replica`.`appointment`

我知道 HiveContext(sc) 在 2.3.0 中已被弃用,但是当在 spark-shell 上将这些作为命令运行时,它们也会给出结果。另外,我想让程序对两个版本的 spark 都通用。有人可以建议一些不使用配置单元数据库文件名直接查询配置单元表的方法吗?

以下是我用来远程连接的 hive-site.xml -

<?xml version="1.0" encoding="UTF-8"?>

<!--Autogenerated by Cloudera Manager-->
<configuration>
  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://fqdn:9083</value>
  </property>
  <property>
    <name>hive.metastore.client.socket.timeout</name>
    <value>300</value>
  </property>
  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
  </property>
  <property>
    <name>hive.warehouse.subdir.inherit.perms</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.auto.convert.join</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.auto.convert.join.noconditionaltask.size</name>
    <value>20971520</value>
  </property>
  <property>
    <name>hive.optimize.bucketmapjoin.sortedmerge</name>
    <value>false</value>
  </property>
  <property>
    <name>hive.smbjoin.cache.rows</name>
    <value>10000</value>
  </property>
  <property>
    <name>hive.server2.logging.operation.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.server2.logging.operation.log.location</name>
    <value>/var/log/hive/operation_logs</value>
  </property>
  <property>
    <name>mapred.reduce.tasks</name>
    <value>-1</value>
  </property>
  <property>
    <name>hive.exec.reducers.bytes.per.reducer</name>
    <value>67108864</value>
  </property>
  <property>
    <name>hive.exec.copyfile.maxsize</name>
    <value>33554432</value>
  </property>
  <property>
    <name>hive.exec.reducers.max</name>
    <value>1099</value>
  </property>
  <property>
    <name>hive.vectorized.groupby.checkinterval</name>
    <value>4096</value>
  </property>
  <property>
    <name>hive.vectorized.groupby.flush.percent</name>
    <value>0.1</value>
  </property>
  <property>
    <name>hive.compute.query.using.stats</name>
    <value>false</value>
  </property>
  <property>
    <name>hive.vectorized.execution.enabled</name>
    <value>false</value>
  </property>
  <property>
    <name>hive.vectorized.execution.reduce.enabled</name>
    <value>false</value>
  </property>
  <property>
    <name>hive.merge.mapfiles</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.merge.mapredfiles</name>
    <value>false</value>
  </property>
  <property>
    <name>hive.cbo.enable</name>
    <value>false</value>
  </property>
  <property>
    <name>hive.fetch.task.conversion</name>
    <value>minimal</value>
  </property>
  <property>
    <name>hive.fetch.task.conversion.threshold</name>
    <value>268435456</value>
  </property>
  <property>
    <name>hive.limit.pushdown.memory.usage</name>
    <value>0.1</value>
  </property>
  <property>
    <name>hive.merge.sparkfiles</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.merge.smallfiles.avgsize</name>
    <value>16777216</value>
  </property>
  <property>
    <name>hive.merge.size.per.task</name>
    <value>268435456</value>
  </property>
  <property>
    <name>hive.optimize.reducededuplication</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.optimize.reducededuplication.min.reducer</name>
    <value>4</value>
  </property>
  <property>
    <name>hive.map.aggr</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.map.aggr.hash.percentmemory</name>
    <value>0.5</value>
  </property>
  <property>
    <name>hive.optimize.sort.dynamic.partition</name>
    <value>false</value>
  </property>
  <property>
    <name>hive.execution.engine</name>
    <value>mr</value>
  </property>
  <property>
    <name>spark.executor.memory</name>
    <value>268435456</value>
  </property>
  <property>
    <name>spark.driver.memory</name>
    <value>268435456</value>
  </property>
  <property>
    <name>spark.executor.cores</name>
    <value>1</value>
  </property>
  <property>
    <name>spark.yarn.driver.memoryOverhead</name>
    <value>26</value>
  </property>
  <property>
    <name>spark.yarn.executor.memoryOverhead</name>
    <value>26</value>
  </property>
  <property>
    <name>spark.dynamicAllocation.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>spark.dynamicAllocation.initialExecutors</name>
    <value>1</value>
  </property>
  <property>
    <name>spark.dynamicAllocation.minExecutors</name>
    <value>1</value>
  </property>
  <property>
    <name>spark.dynamicAllocation.maxExecutors</name>
    <value>2147483647</value>
  </property>
  <property>
    <name>hive.metastore.execute.setugi</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.support.concurrency</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.zookeeper.quorum</name>
    <value>fqdn</value>
  </property>
  <property>
    <name>hive.zookeeper.client.port</name>
    <value>2181</value>
  </property>
  <property>
    <name>hive.zookeeper.namespace</name>
    <value>hive_zookeeper_namespace_CD-HIVE-WAyDdBlP</value>
  </property>
  <property>
    <name>hive.cluster.delegation.token.store.class</name>
    <value>org.apache.hadoop.hive.thrift.MemoryTokenStore</value>
  </property>
  <property>
    <name>hive.server2.enable.doAs</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.metastore.sasl.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.metastore.kerberos.principal</name>
    <value>hive/_HOST@EXAMPLE.COM</value>
  </property>
  <property>
    <name>hive.server2.authentication.kerberos.principal</name>
    <value>hive/_HOST@EXAMPLE.COM</value>
  </property>
  <property>
    <name>spark.shuffle.service.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.server2.authentication</name>
    <value>LDAP</value>
  </property>
</configuration>

在这里,fqdn 在运行时被主机 hdfs FQDN 替换,并且在 spark 1.6.0 上完美运行。

4

1 回答 1

0

enableHiveSupport()在 spark 2.xx 中创建 SparkSession 时需要使用

val spark = SparkSession.builder()
      .appName("Example")
      .master("local")
      .config("hive.metastore.uris","thrift://B:PortNumber")
      .enableHiveSupport()                // <---- This line here
      .getOrCreate()

如果你想要通用 - 我认为你只需要分别创建 SparkContext 和 HiveContext :

if (sparkVersion <= 2.x.x) {
    // create the old way
}
else 
{
   //create spark session and then get SparkContext and HiveContext from it.
}

在这里您可以找到如何以编程方式了解 Spark 版本

于 2018-05-18T07:03:51.953 回答