我正在尝试使用 databricks 最新版本(7.1+,spark 3.0)与 pyspark 作为脚本编辑器/基本语言连接 bigquery。
我们运行下面的 pyspark 脚本来从 bigquery 表中获取数据到 databricks
from pyspark.sql import SparkSession
spark = (
SparkSession.builder
.appName('bq')
.master('local[4]')
.config('parentProject', 'google-project-ID')
.config('spark.jars', 'jarlocation.jar') \
.getOrCreate()
)
df = spark.read.format("bigquery").option("credentialsFile", "file path") \
.option("parentProject", "google-project-ID") \
.option("project", "Dataset-Name") \
.option("table","dataset.schema.tablename") \
.load()
运行脚本后,当我们尝试查看数据时,我们能够以嵌套格式获取数据。
{"visitId":"1607519947"},
{"visitStartTime":"1607519947"},
{"date":"20201209"},
{"totals":{"visits": 1, "hits": 1, "pageviews": 1, "timeOnSite": null, "bounces": 1, "transactions": null, "transactionRevenue": null, "newVisits": 1, "screenviews": null, "uniqueScreenviews": null, "timeOnScreen": null, "totalTransactionRevenue": null, "sessionQualityDim": 0}},
{"hits": [{"hitNumber": 1, "time": 0, "hour": 14, "minute": 19, "isExit": true, "referer": null,
"page": {"pagePath": "/nieuws/Post-hoc-analyse-naar-KRd-bij-18-maanden-na-randomisatie", "hostname": "www.amgenhematologie.nl", "pagePathLevel4": ""},
"transaction": {"transactionId": null, "transactionRevenue": null, "transactionTax": null, "transactionShipping": null, "affiliation": null},
"item": {"transactionId": null, "productName": null, "productCategory": null, "productSku": null, "itemQuantity": null, "itemRevenue": null, "currencyCode": "(not set)", "localItemRevenue": null},
"eventInfo": null,
"product": [],
"promotion": [],
"promotionActionInfo": null, "refund": null,
"eCommerceAction": {"action_type": "0", "step": 1, "option": null},
"experiment": [],
"publisher": null,
"customVariables": [],
"customDimensions": [],
"customMetrics": [],
"type": "PAGE",
"social": {"socialInteractionNetwork": null, "socialInteractionAction": null, "socialInteractions": null, "socialInteractionTarget": null, "socialNetwork": "(not set)", "uniqueSocialInteractions": null, "hasSocialSourceReferral": "No", "socialInteractionNetworkAction": " : "},
"dataSource": "web",
"publisher_infos": []}]}
以上是嵌套数据格式的示例数据。
在此,前 3 列 visitId、visitStartTime 和 date 是直接列
第 4 列 Totals 采用嵌套格式,需要以 totals.visits、totals.hits 等格式取消嵌套,作为单独的列标题,如 1st 3 列及其值
第 5 列也是如此,它有多个嵌套字典,并且应该将字典内的每一列取消嵌套为单独的列标题,我在上面的第 4 列中提到过。
直接从 bigquery 读取数据时,是否有在 pyspark 中取消嵌套数据?
帮助将不胜感激。提前致谢!