palantir-foundry - 如何处理“无法在 300 秒内执行广播”？

Question

我正在尝试使构建工作，并且其中一个阶段间歇性地失败并出现以下错误：

Could not execute broadcast in 300 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting spark.sql.autoBroadcastJoinThreshold to -1

我应该如何处理这个错误？

score 3 · Accepted Answer

首先，让我们谈谈这个错误的含义。

从官方 Spark 文档（http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables）：

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

根据我的经验，广播超时通常发生在输入数据集之一分区不佳时。我建议您查看数据集的分区并确保它们正确分区，而不是禁用广播。

我使用的经验法则是将数据集的大小（以 MB 为单位）除以 100，然后将分区数设置为该数字。由于 HDFS 块大小为 125 MB，我们希望将文件拆分为大约 125 MB，但由于它们不能完美分割，我们可以除以较小的数字以获得更多分区。

主要的是非常小的数据集（~<125 MB）在单个分区中，因为网络开销太大了！希望这可以帮助。

palantir-foundry - 如何处理“无法在 300 秒内执行广播”？

1 回答 1

Related

Reference