apache-beam - TextIO.Read().From() vs TextIO.ReadFiles() over withHintMatchesManyFiles()

Question

在我的用例中，从 Kafka 获取一组匹配的文件模式，

PCollection<String> filepatterns = p.apply(KafkaIO.read()...);

在这里，每个模式最多可以匹配 300 多个文件。

Q1。我如何使用TextIO.Read()来匹配来自的数据PCollection，因为它withHintMatchesManyFiles()仅适用于TextIO.Read()不适用于TextIO.ReadFiles().

Q2。如果使用通过 FileIO.Match->FileIO.ReadMatch()->TextIO.ReadFiles() 的withHintMatchesManyFiles()路径，在该路径中不可用，它将如何影响读取性能？

Q3。上述用例还有其他优化路径吗？

score 1 · Accepted Answer

是的，你不能开箱即用withHintMatchesManyFiles()。TextIO.ReadFiles()实际上，TextIO.Read().withHintMatchesManyFiles()是通过FileIOtransforms +实现的TextIO.ReadFiles()（见详情）。这样，FileIO.readMatches()应该将读取的文件分发给工作人员。

因此，我认为您可以在从 Kafka 主题读取文件名时使用相同的方法。

score 0 · Accepted Answer

如何使用 TextIO.Read() 匹配来自 PCollection 的数据，因为 withHintMatchesManyFiles() 仅适用于 TextIO.Read() 而不适用于 TextIO.ReadFiles()。

我对 Apache Beam 尤其是 PTransforms 的理解非常有限，它TextIO.read()创建了一个根 PTransform，它只能在管道的最开始使用。换句话说，TextIO.Read不能在任何类型的 PTransform 之后使用。

apache-beam - TextIO.Read().From() vs TextIO.ReadFiles() over withHintMatchesManyFiles()

2 回答 2

Related

Reference