apache-spark - Spark log-“min key = null, max key = null”，同时读取 ORC 文件

Question

我正在尝试通过 Spark 将 Dataframe（比如 100 条记录）与一个包含 1 亿条记录的 ORC 文件连接起来（可以增加到 4-50 亿，每条记录 25 字节）。它也是使用 Spark hiveContext API 创建的。

ORC 文件创建代码

//fsdtRdd is JavaRDD, fsdtSchema is StructType schema
DataFrame fsdtDf = hiveContext.createDataFrame(fsdtRdd,fsdtSchema);
fsdtDf.write().mode(SaveMode.Overwrite).orc("orcFileToRead");

ORC文件读取代码

HiveContext hiveContext = new HiveContext(sparkContext);
DataFrame orcFileData= hiveContext.read().orc("orcFileToRead");
// allRecords is dataframe
DataFrame processDf = allRecords.join(orcFileData,allRecords.col("id").equalTo(orcFileData.col("id").as("ID")),"left_outer_join");
processDf.show();

读取时的 Spark 日志（从本地）

Input split: file:/C:/spark/orcFileToRead/part-r-00024-b708c946-0d49-4073-9cd1-5cc46bd5972b.orc:0+3163348
min key = null, max key = null
Reading ORC rows from file:/C:/spark/orcFileToRead/part-r-00024-b708c946-0d49-4073-9cd1-5cc46bd5972b.orc with {include: [true, true, true], offset: 0, length: 9223372036854775807}
Finished task 55.0 in stage 2.0 (TID 59). 2455 bytes result sent to driver
Starting task 56.0 in stage 2.0 (TID 60, localhost, partition 56,PROCESS_LOCAL, 2220 bytes)
Finished task 55.0 in stage 2.0 (TID 59) in 5846 ms on localhost (56/84)
Running task 56.0 in stage 2.0 (TID 60)

尽管 Spark 作业成功完成，但我认为它无法利用 ORC 索引文件功能，因此在继续之前检查整个 ORC 数据块。

问题

-- 这是正常行为，还是我必须在以 ORC 格式保存数据之前设置任何配置？

-- 如果是NORMAL，最好的加入方式是什么，以便我们丢弃磁盘级别的不匹配记录（可能只加载 ORC 数据的索引文件）？

apache-spark - Spark log-“min key = null, max key = null”，同时读取 ORC 文件

0 回答 0

Related

Reference