问题标签 [spark-hive]

问问题

For questions regarding programming in ECMAScript (JavaScript/JS) and its various dialects/implementations (excluding ActionScript). Note JavaScript is NOT the same as Java! Please include all relevant tags on your question; e.g., [node.js], [jquery], [json], [reactjs], [angular], [ember.js], [vue.js], [typescript], [svelte], etc.

68 问题

0 投票

2 回答

185 浏览

hive - hive 和 hive-llap 之间的结果集不一致

我们在 HDI 4.0 上使用 Hive 3.1.x 集群，其中 1 是 LLAP，另一个是 Just HIVE。

我们在两个集群上创建了一个托管表，行数为272409.

在两个集群上合并之前

Based on the delta, we'd perform a merge operation (which updates 17 rows).

在 hive-llap 集群上合并后（压缩前）

在 hive-llap 集群上合并后（压缩后）

仅在 hive 集群上合并后（不压缩增量）

这是观察到的不一致

但是，在 hive-llap 上压缩表后，没有看到结果集不一致，两个集群都返回相同的结果。

We thought it might be due to either caching or llap issue, so we restarted the hive-server2 process which will clear the cache. The issue is still persistent.

We also created a dummy table with same schema on just hive cluster and pointed the location of that table to that of llap one, which in turn is producing result as expected.

We even queried on spark using **Qubole spark-acid reader** (direct hive managed table reader), which is also producing expected result

这是非常奇怪和奇特的，有人可以在这里帮忙。

2020-07-30T17:51:45.897

0 投票

0 回答

202 浏览

java - Spark 作业在 oozie 中因启用 hive 支持而失败

我正在尝试使用 spark 操作并启用 Hive 支持来安排 oozie 工作流程。当没有蜂巢支持的普通火花工作时，动作运行正常。添加配置单元支持后，我可以通过 spark-submit 运行 spark 作业。但是当我试图在 oozie 中运行时，它失败了

下面是创建火花会话的代码：

以下是依赖项：

以下是 oozie 工作流操作：

我是否需要在 share-lib 目录中添加更多内容或删除任何内容。

-- 已编辑 --- 如果我没有在全局属性中添加配置单元，则会出现上述错误。如果我们在全局属性中添加 hive

然后如果抛出另一个异常

java apache-spark hive oozie spark-hive

2020-08-17T11:09:45.977

0 投票

1 回答

271 浏览

apache-spark - Spark saveAsTable 的位置位于 s3 存储桶的根本原因 NullPointerException

我正在使用 Spark 3.0.1，我的分区表存储在 s3 中。请在此处找到问题的描述。

创建表

在第二次运行时导致 NullPointerException 的代码

当 Hive 元存储为空时，一切正常，但是当 spark 尝试同步时，问题就发生getCustomPartitionLocations了 InsertIntoHadoopFsRelationCommand。（例如第二次运行）

实际上它调用了以下方法： from ( org.apache.hadoop.fs.Path)

但是getParent()当我们在 root 时会返回 null，从而导致 NullPointerException。我目前正在考虑的唯一选择是重写此方法以执行以下操作：

LOCATION当火花蜂巢表处于根级别时，有人遇到问题吗？任何解决方法？是否有任何已知问题已打开？

我的运行时不允许我覆盖 Path 类并修复该suffix方法，并且我无法从存储桶的根目录中移动我的数据，因为它已经存在 2 年了。

出现问题是因为我正在从 Spark 2.1.0 迁移到 Spark 3.0.1，并且检查自定义分区的行为出现在 Spark 2.2.0 中（https://github.com/apache/spark/pull/16460）

整个上下文有助于理解问题，但基本上你可以轻松地重现它

供参考。hadoop-common 版本是 2.7.4，请在此处找到完整的堆栈跟踪

谢谢

apache-spark hadoop hive hadoop2 spark-hive

2020-10-09T21:39:52.380

0 投票

1 回答

89 浏览

java - 使用 Java 和 Spark Eclipse 连接 Dataproc Hive 服务器时出现异常

我正在尝试使用 java 和 spark 从我的本地计算机（eclipse）访问 GCP 中存在的 Hive 服务器 - Dataproc。但是在启动应用程序时出现以下错误。我试图找到问题，但无法解决。

线程“main”java.lang.IllegalArgumentException 中的异常：无法使用 Hive 支持实例化 SparkSession，因为未找到 Hive 类。

在 org.apache.spark.sql.SparkSession$Builder.enableHiveSupport(SparkSession.scala:870) 在 com.hadoop.Application.main(Application.java:22)

Pom.xml：

java apache-spark google-cloud-platform google-cloud-dataproc spark-hive

2021-06-18T16:22:03.377

0 投票

1 回答

77 浏览