pyspark - Azure databricks 数据帧写入会导致作业中止错误

Question

我正在尝试将数据写入 csv 文件并将文件存储在 Azure Data Lake Gen2 上并遇到作业中止错误消息。这个相同的代码以前可以正常工作。

错误信息：

org.apache.spark.SparkException: Job aborted.

代码：

import requests
response = requests.get('https://myapiurl.com/v1/data', auth=('user', 'password'))
data = response.json()
from pyspark.sql import *
df=spark.createDataFrame([Row(**i) for i in data])  
df.write.format(source).mode("overwrite").save(path) #error line

score 2 · Accepted Answer

我总结了下面的解决方案

如果要访问 Azure databricks 中的 Azure 数据湖 gen2，有两种选择。

将 Azure 数据湖 gen2 挂载为 Azure databricks 的文件系统。完成后，您可以使用路径读取和写入文件/mnt/<>。我们只需要运行一次代码。

一种。创建服务主体并将 Storage Blob Data Contributor 分配给 Data Lake Storage Gen2 存储帐户范围内的 sp

 az login

 az ad sp create-for-rbac -n "MyApp" --role "Storage Blob Data Contributor" \
--scopes /subscriptions/<subscription>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-account>

湾。代码

 configs = {"fs.azure.account.auth.type": "OAuth",
  "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
  "fs.azure.account.oauth2.client.id": "<appId>",
  "fs.azure.account.oauth2.client.secret": "<clientSecret>",
  "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant>/oauth2/token",
  "fs.azure.createRemoteFileSystemDuringInitialization": "true"}

 dbutils.fs.mount(
    source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/folder1",
    mount_point = "/mnt/flightdata",
    extra_configs = configs)

使用存储帐户访问密钥直接访问。

我们可以将代码添加spark.conf.set( "fs.azure.account.key.<storage-account-name>.dfs.core.windows.net", "<storage-account-access-key-name>")到我们的脚本中。然后我们就可以用 path 读写文件了abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/。

例如

 from pyspark.sql.types import StringType
 spark.conf.set(
   "fs.azure.account.key.testadls05.dfs.core.windows.net", "<account access key>")

  df = spark.createDataFrame(["10", "11", "13"], StringType()).toDF("age")
  df.show()
  df.coalesce(1).write.format('csv').option('header', True).mode('overwrite').save('abfss://test@testadls05.dfs.core.windows.net/result_csv')

更多详情，请参考这里

pyspark - Azure databricks 数据帧写入会导致作业中止错误

1 回答 1

Related

Reference