pyspark - Pyspark 流处理每个触发器 1 条记录

Question

我正在尝试 Databricks Developer Foundation Capstone，但我似乎无法通过流式练习。

我应该读取 json 数据流，对其进行转换并将其附加回表中。

我像这样创建了 DataFrame：

df = (spark.readStream.schema(DDLSchema).option("maxFilesPerTrigger", 1).json(stream_path))

然后我用orders_df = df.select(...)

ordersQuery = (orders_df.writeStream
               .outputMode("append")
               .format("delta")
               .partitionBy('submitted_yyyy_mm')
               .queryName(orders_table)
               .trigger(processingTime="1 second")
               .option("checkpointLocation", orders_checkpoint_path)
               .table(orders_table))

转换和一切工作正常，但最后的检查一直失败，它说：

预计前 20 个触发器每个触发器处理 1 条记录 | 失败的

我用谷歌搜索了这个问题，但我无法在任何地方找到答案。

score 0 · Accepted Answer

0

删除检查点路径将解决问题。尝试运行：

dbutils.fs.rm(orders_checkpoint_path, True)

于 2021-10-24T13:19:26.400 回答

pyspark - Pyspark 流处理每个触发器 1 条记录

1 回答 1

Related

Reference