pyspark - 有没有办法在单个 pyspark 脚本中取消嵌套数据块中的 bigquery 列

Question

我正在尝试使用 databricks 最新版本（7.1+，spark 3.0）与 pyspark 作为脚本编辑器/基本语言连接 bigquery。

我们运行下面的 pyspark 脚本来从 bigquery 表中获取数据到 databricks

from pyspark.sql import SparkSession
spark = (
    SparkSession.builder
    .appName('bq')
    .master('local[4]')
    .config('parentProject', 'google-project-ID')
    .config('spark.jars', 'jarlocation.jar') \
    .getOrCreate()
)
df = spark.read.format("bigquery").option("credentialsFile", "file path") \
  .option("parentProject", "google-project-ID") \
  .option("project", "Dataset-Name") \
  .option("table","dataset.schema.tablename") \
  .load()

运行脚本后，当我们尝试查看数据时，我们能够以嵌套格式获取数据。

{"visitId":"1607519947"},
{"visitStartTime":"1607519947"},
{"date":"20201209"},
{"totals":{"visits": 1, "hits": 1, "pageviews": 1, "timeOnSite": null, "bounces": 1, "transactions": null, "transactionRevenue": null, "newVisits": 1, "screenviews": null, "uniqueScreenviews": null, "timeOnScreen": null, "totalTransactionRevenue": null, "sessionQualityDim": 0}},
{"hits": [{"hitNumber": 1, "time": 0, "hour": 14, "minute": 19, "isExit": true, "referer": null, 
"page": {"pagePath": "/nieuws/Post-hoc-analyse-naar-KRd-bij-18-maanden-na-randomisatie", "hostname": "www.amgenhematologie.nl", "pagePathLevel4": ""}, 
"transaction": {"transactionId": null, "transactionRevenue": null, "transactionTax": null, "transactionShipping": null, "affiliation": null},
"item": {"transactionId": null, "productName": null, "productCategory": null, "productSku": null, "itemQuantity": null, "itemRevenue": null, "currencyCode": "(not set)", "localItemRevenue": null}, 
"eventInfo": null, 
"product": [], 
"promotion": [], 
"promotionActionInfo": null, "refund": null, 
"eCommerceAction": {"action_type": "0", "step": 1, "option": null}, 
"experiment": [], 
"publisher": null, 
"customVariables": [], 
"customDimensions": [], 
"customMetrics": [], 
"type": "PAGE", 
"social": {"socialInteractionNetwork": null, "socialInteractionAction": null, "socialInteractions": null, "socialInteractionTarget": null, "socialNetwork": "(not set)", "uniqueSocialInteractions": null, "hasSocialSourceReferral": "No", "socialInteractionNetworkAction": " : "}, 
"dataSource": "web", 
"publisher_infos": []}]}

示例嵌套数据框

以上是嵌套数据格式的示例数据。

在此，前 3 列 visitId、visitStartTime 和 date 是直接列

第 4 列 Totals 采用嵌套格式，需要以 totals.visits、totals.hits 等格式取消嵌套，作为单独的列标题，如 1st 3 列及其值

第 5 列也是如此，它有多个嵌套字典，并且应该将字典内的每一列取消嵌套为单独的列标题，我在上面的第 4 列中提到过。

直接从 bigquery 读取数据时，是否有在 pyspark 中取消嵌套数据？

帮助将不胜感激。提前致谢！

pyspark - 有没有办法在单个 pyspark 脚本中取消嵌套数据块中的 bigquery 列

0 回答 0

Related

Reference