问题标签 [spark-structured-streaming]

问问题

For questions regarding programming in ECMAScript (JavaScript/JS) and its various dialects/implementations (excluding ActionScript). Note JavaScript is NOT the same as Java! Please include all relevant tags on your question; e.g., [node.js], [jquery], [json], [reactjs], [angular], [ember.js], [vue.js], [typescript], [svelte], etc.

1999 问题

0 投票

1 回答

1609 浏览

apache-spark - SPARK 结构化流中的 StructField 是否存在错误

当我尝试这个时：

然后我得到一些错误：

AssertionError: dataType 应该是 DataType

我在 ./pyspark/sql/types.py 的第 403 行搜索源代码：

但是StringType基于AtomicType而不是DataType

那么有错误吗？

2017-01-06T08:47:57.993

0 投票

1 回答

451 浏览

scala - MQTT 结构化流

我正在尝试设置 Spark Streaming 以读取 MQTT 源，但是当我收到第二条消息时它会启动异常。

我有以下代码：

当我收到第二条消息时，我观察到以下异常：

有人遇到过这个问题吗？

scala apache-spark mqtt spark-structured-streaming

2017-01-16T16:21:38.553

0 投票

1 回答

821 浏览

apache-spark - structured streaming 2.1.0 kafka driver works on YARN with --packages but having trouble on standalone cluster mode

Currently we are testing structured streaming Kafka drivers. We submit on YARN(2.7.3) with --packages 'org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0', without problems. However when we try to launch on spark standalone with deploy mode=cluster, we get the

error even though the launch command has added the Kafka jars to -Dspark.jars (see below) and subsequent log further states these jars have been successfully added.

All 10 jars exist in /home/spark/.ivy2 on all nodes. I manually checked to see that KafkaSourceProvider class does exist in the org.apache.spark_spark-sql-kafka-0-10_2.11-2.1.0.jar. I further confirmed there are no issues with the jars by launching the driver in YARN without the --packages option and manually adding all 10 jars with --jars option. The nodes run Scala 2.11.8.

Any insights appreciated.

The automatically added jars by spark-submit:
/li>
Spark info messages which appears to have loaded these jars:
/li>
The error message:
/li>

apache-spark spark-structured-streaming

2017-01-26T23:52:28.817

0 投票

5 回答

5513 浏览

apache-spark - Apache Spark（结构化流）：S3 检查点支持

来自 spark 结构化流文档：“此检查点位置必须是 HDFS 兼容文件系统中的路径，并且可以DataStreamWriter在启动查询时设置为选项。”

果然，将检查点设置为 s3 路径会引发：

这里有几个问题：

为什么不支持 s3 作为检查点目录（常规 spark 流支持此）？是什么让文件系统“符合 HDFS”？
我临时使用 HDFS（因为集群可以随时启动或关闭）并使用 s3 作为保存所有数据的地方 - 在这种设置中存储结构化流数据的检查点数据的建议是什么？

apache-spark spark-structured-streaming

2017-02-02T15:53:48.183

0 投票

1 回答

269 浏览

apache-spark - ApacheSpark 流上的 ApacheBahir 结构化流连接器的架构问题

我正在尝试将 Apache Spark 结构化流连接到 MQTT 主题（在本例中为 IBM Bluemix 上的 IBM Watson IoT Platform）。

我正在创建结构化流，如下所示：

到目前为止一切顺利，在 REPL 中我得到了这个 df 对象，如下所示：

但是，如果我开始使用这一行从流中读取：

我收到以下错误：

我的直觉说架构有问题，所以我添加了一个：

但这无济于事，有什么想法吗？

apache-spark mqtt spark-structured-streaming watson-iot apache-bahir

2017-02-03T06:37:03.413

0 投票

1 回答

6271 浏览

scala - 为什么在流数据集上使用缓存失败并出现“AnalysisException：必须使用 writeStream.start() 执行带有流源的查询”？

在 Spark 2.1.0 中执行此示例时出现错误。如果没有该.cache选项，它会按预期工作，但有.cache选项我得到：

任何的想法？

scala apache-spark apache-spark-sql apache-spark-2.0 spark-structured-streaming

2017-02-06T07:07:00.487

0 投票

1 回答

1765 浏览

apache-spark - Spark Structured Stream 仅从 Kafka 的一个分区获取消息

我遇到了一种情况，即 spark 只能从 Kafka 2-patition 主题的一个分区流式传输和获取消息。

我的主题： C:\bigdata\kafka_2.11-0.10.1.1\bin\windows>kafka-topics --create --zookeeper localhost:2181 --partitions 2 --replication-factor 1 --topic test4

卡夫卡制作人：

Spark结构化流：

当我运行生产者在文件中发送 100 行时，查询仅返回 51 行。我阅读了 spark 的调试日志，发现如下：

我不知道为什么 test4-1 总是重置为最接近的偏移量。

如果有人知道如何从所有分区获取所有消息，我将不胜感激。谢谢，

apache-spark apache-kafka spark-structured-streaming

2017-02-15T00:59:03.423

0 投票

1 回答

1882 浏览

scala - Spark结构化流ForeachWriter无法获取sparkContext

我正在使用Spark 结构化流从Kafka 队列中读取JSON 数据，但我需要将JSON 数据写入Elasticsearch。

但是，我无法将sparkContextJSONForeachWriter转换为 RDD。它抛出 NPE。

如何SparkContext进入 Writer 将 JSON 转换为 RDD？

scala apache-spark spark-structured-streaming elasticsearch-spark

2017-02-22T14:41:11.970

0 投票

2 回答

1540 浏览

scala - 如何为 Spark 结构化流编写 ElasticsearchSink

我正在使用 Spark 结构化流处理来自 Kafka 队列的大量数据并进行一些繁重的 ML 计算，但我需要将结果写入 Elasticsearch。

我尝试使用ForeachWriter但无法SparkContext在其中使用，另一种选择可能是HTTP Post在ForeachWriter.

现在，我正在考虑编写我自己的 ElasticsearchSink。

是否有任何文档可以为 Spark 结构化流创建接收器？

scala apache-spark elasticsearch spark-structured-streaming

2017-02-23T22:00:07.877

0 投票

1 回答

1199 浏览

scala - 为什么内存接收器在附加模式下什么都不写？

我使用 Spark 的结构化流传输来自 Kafka 的消息。然后聚合数据，并以附加模式写入内存接收器。但是，当我尝试查询内存时，它什么也没返回。下面是代码：

结果总是：

如果我使用outputMode = complete，那么我可以获得聚合数据。但这不是我的选择，因为要求是使用附加模式。

代码有问题吗？谢谢，

scala apache-spark spark-structured-streaming

2017-02-28T04:12:58.697

1 2 3 4 5 6 7 8 9 10

问题标签 [spark-structured-streaming]

Reference