8

I am trying to use NiFi to process large CSV files (potentially billions of records each) using HDF 1.2. I've implemented my flow, and everything is working fine for small files.

The problem is that if I try to push the file size to 100MB (1M records) I get a java.lang.OutOfMemoryError: GC overhead limit exceeded from the SplitText processor responsible of splitting the file into single records. I've searched for that, and it basically means that the garbage collector is executed for too long without obtaining much heap space. I expect this means that too many flow files are being generated too fast.

How can I solve this? I've tried changing nifi's configuration regarding the max heap space and other memory-related properties, but nothing seems to work.

Right now I added an intermediate SplitText with a line count of 1K and that allows me to avoid the error, but I don't see this as a solid solution for when the incoming file size will become potentially much more than that, I am afraid I will get the same behavior from the processor.

Any suggestion is welcomed! Thank you

4

2 回答 2

8

错误的原因是当拆分 1M 行数为 1 的记录时,您正在创建等同于 1M Java 对象的 1M 流文件。总体而言,使用两个 SplitText 处理器的方法很常见,并且可以避免同时创建所有对象。您可能会在第一次拆分时使用更大的拆分大小,也许是 10k。对于十亿条记录,我想知道第三个级别是否有意义,从 1B 到可能 10M,然后从 10M 到 10K,然后从 10K 到 1,但我必须使用它。

需要考虑的其他一些事情是从 512MB 增加默认堆大小,您可能已经这样做了,并且还要弄清楚您是否真的需要拆分为 1 行。如果不了解有关流程的任何其他信息,很难说,但在很多情况下,如果您想将每一行传递到某个地方,您可能会有一个处理器读取一个大的分隔文件并将每一行流式传输到目的地。例如,这就是 PutKafka 和 PutSplunk 的工作方式,它们可以获取一个 1M 行的文件并将每行流式传输到目标。

于 2016-07-29T12:44:37.200 回答
0

在 Apache NiFi 中使用 GetMongo 处理器时,我遇到了类似的错误。我将配置更改为:

Limit: 100
Batch Size: 10

然后错误消失了。

于 2021-10-12T11:34:12.167 回答