0

我试图计算统计信息并获取各个列的统计信息。而且我看到所有列的所有统计信息都为 NULL。不知道我在这里可能犯了什么错误。

ordersSchemaDDL = "orderid Int, ordertime Timestamp, custid Int, Status String"

orders_df = spark.read \
    .format("csv") \
    .option("header",True) \
    .schema(ordersSchemaDDL) \
    .option("mode","DROPMALFORMED") \
.option("path","orders.csv") \
.load()

spark.sql("create database if not exists saveAsTable")

spark.sql("ANALYZE TABLE saveAsTable.orders_bucketed COMPUTE STATISTICS;")
spark.sql("DESCRIBE EXTENDED saveAsTable.orders_bucketed orderid;").show(truncate=False)

订单表:我们可以看到它有很多数据

 +++++
    orderid          ordertimecustid         Status
    +++++
          120130725 00:00:00 11599         CLOSED
          220130725 00:00:00   256PENDING_PAYMENT
          320130725 00:00:00 12111       COMPLETE
          420130725 00:00:00  8827         CLOSED
          520130725 00:00:00 11318       COMPLETE
          620130725 00:00:00  7130       COMPLETE
    



  Statistics Output:
   info_name     info_value

    col_name       orderid   
    data_type      int       
    comment        NULL      
    min            NULL      
    max            NULL      
    num_nulls      NULL      
    distinct_count NULL      
    avg_col_len    NULL      
    max_col_len    NULL      
    histogram      NULL      
4

0 回答 0