我试图计算统计信息并获取各个列的统计信息。而且我看到所有列的所有统计信息都为 NULL。不知道我在这里可能犯了什么错误。
ordersSchemaDDL = "orderid Int, ordertime Timestamp, custid Int, Status String"
orders_df = spark.read \
.format("csv") \
.option("header",True) \
.schema(ordersSchemaDDL) \
.option("mode","DROPMALFORMED") \
.option("path","orders.csv") \
.load()
spark.sql("create database if not exists saveAsTable")
spark.sql("ANALYZE TABLE saveAsTable.orders_bucketed COMPUTE STATISTICS;")
spark.sql("DESCRIBE EXTENDED saveAsTable.orders_bucketed orderid;").show(truncate=False)
订单表:我们可以看到它有很多数据
+++++
orderid ordertimecustid Status
+++++
120130725 00:00:00 11599 CLOSED
220130725 00:00:00 256PENDING_PAYMENT
320130725 00:00:00 12111 COMPLETE
420130725 00:00:00 8827 CLOSED
520130725 00:00:00 11318 COMPLETE
620130725 00:00:00 7130 COMPLETE
Statistics Output:
info_name info_value
col_name orderid
data_type int
comment NULL
min NULL
max NULL
num_nulls NULL
distinct_count NULL
avg_col_len NULL
max_col_len NULL
histogram NULL