apache-spark - 用 Apache Hudi 编写的 Parquet 文件名的每个部分代表什么？

Question

Apache Hudi 写出每个 parquet 文件，如下所示：

0743209d-51cb-4233-a7cd-5bb712fba1ff-0_21-64-5300_20211117172738.parquet

我试图了解文件的每个部分代表什么。这是我目前的理解，但我想得到任何可能知道的人的确认和澄清。

0743209d-51cb-4233-a7cd-5bb712fba1ff = file group/file name

-0 = file chunk

20211117172738 = timestamp of the batch

我不确定以下部分代表什么：

21-64-5300=?

score 0 · Accepted Answer

这是我发现的：

hudi file format -- 0743209d-51cb-4233-a7cd-5bb712fba1ff-0_21-64-5300_20211117172738.parquet
first part is a unique identifier of the file group.
next is write token.
and then the commit time.
Write token is to assist with detecting spark write failures.

public static String makeDataFileName(String instantTime, String writeToken, String fileId, String fileExtension) {
    return String.format("%s_%s_%s%s", fileId, writeToken, instantTime, fileExtension);
  }

apache-spark - 用 Apache Hudi 编写的 Parquet 文件名的每个部分代表什么？

1 回答 1

Related

Reference