azure - spark.conf.set 与 SparkR

Question

SparkR我有一个在 Azure 上运行的 Databricks 集群，并希望使用/从 Azure Data Lake Storage 读取/写入数据sparklyr。因此我配置了这两个资源。

现在，我必须为 Spark 环境提供必要的配置以针对 Data Lake Storage 进行身份验证。

PySpark API使用作品设置配置：

    spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
    spark.conf.set("dfs.adls.oauth2.client.id", "****")
    spark.conf.set("dfs.adls.oauth2.credential", "****")
    spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/****/oauth2/token")

最后SparkR/sparklyr应该使用。在这里我无法弄清楚在哪里设置spark.conf.set. 我会猜到类似的东西：

    sparkR.session(
    sparkConfig = list(spark.driver.memory = "2g",
    spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential"),
    spark.conf.set("dfs.adls.oauth2.client.id", "****"),
    spark.conf.set("dfs.adls.oauth2.credential", "****"),
    spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/****/oauth2/token")
    ))

SparkR如果其中一位使用API 的专家可以在这里帮助我，那就太棒了。谢谢！

编辑：user10791349 的答案是正确的并且有效。另一种解决方案是安装外部数据源，这是最佳实践。这目前只能使用 Scala 或 Python 实现，但之后可以使用 SparkR API 使用挂载的数据源。

score 3 · Accepted Answer

sparkConfig 应该

要在工作节点上设置的 Spark 配置的命名列表。

所以正确的格式是

sparkR.session(
  ... # All other options
  sparkConfig = list(
    spark.driver.memory = "2g",
    dfs.adls.oauth2.access.token.provider.type = "ClientCredential",
    dfs.adls.oauth2.client.id = "****",
    dfs.adls.oauth2.credential = "****",
    dfs.adls.oauth2.refresh.url ="https://login.microsoftonline.com/****/oauth2/token"
  )
)

请记住，只有在没有活动会话时才会识别许多配置。

azure - spark.conf.set 与 SparkR

1 回答 1

Related

Reference