excel - 在 Spark 错误中读取 Excel：ZipArchiveInputStream 类的 InputStream 未实现 InputStreamStatistics

Question

我正在尝试通过 spark 从 COS 读取 excel 文件，就像这样

    def readExcelData(filePath: String, spark: SparkSession): DataFrame =
        spark.read
          .format("com.crealytics.spark.excel")
          .option("path", filePath)
          .option("useHeader", "true")
          .option("treatEmptyValuesAsNulls", "true")
          .option("inferSchema", "False")
          .option("addColorColumns", "False")
          .load()
  def readAllFiles: DataFrame = {
      val objLst //contains  the list the file paths
      val schema = StructType(
          StructField("col1", StringType, true) ::
            StructField("col2", StringType, true) ::
            StructField("col3", StringType, true) ::
            StructField("col4", StringType, true) :: Nil
        )
      var initialDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
        for (file <- objLst) {
          initialDF = initialDF.union(
            readExcelData(file, spark).select($"col1", $"col2", $"col3", $"col4"))
        }
}

在这段代码中，我首先创建一个空数据框，然后读取所有 excel 文件（通过迭代文件路径）并通过联合操作合并数据。

它抛出这样的错误

java.lang.IllegalArgumentException: InputStream of class class org.apache.commons.compress.archivers.zip.ZipArchiveInputStream is not implementing InputStreamStatistics.
    at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.<init>(ZipArchiveThresholdInputStream.java:63)

sparkExcel 版本为 0.10.2

score 0 · Accepted Answer

尝试删除原始语句的 .show() 并首先转换为数据框。

def readExcel(file: String): DataFrame = spark.read
        .format("com.crealytics.spark.excel")
        .option("useHeader", "true")
        .option("treatEmptyValuesAsNulls", "true")
        .option("inferSchema", "False")
        .option("addColorColumns", "False")
        .load()
val data = readExcel("path to your excel file")

data.show()

excel - 在 Spark 错误中读取 Excel：ZipArchiveInputStream 类的 InputStream 未实现 InputStreamStatistics

1 回答 1

Related

Reference