I am trying to use NiFi to process large CSV files (potentially billions of records each) using HDF 1.2. I've implemented my flow, and everything is working fine for small files.
The problem is that if I try to push the file size to 100MB (1M records) I get a java.lang.OutOfMemoryError: GC overhead limit exceeded
from the SplitText processor responsible of splitting the file into single records. I've searched for that, and it basically means that the garbage collector is executed for too long without obtaining much heap space. I expect this means that too many flow files are being generated too fast.
How can I solve this? I've tried changing nifi's configuration regarding the max heap space and other memory-related properties, but nothing seems to work.
Right now I added an intermediate SplitText with a line count of 1K and that allows me to avoid the error, but I don't see this as a solid solution for when the incoming file size will become potentially much more than that, I am afraid I will get the same behavior from the processor.
Any suggestion is welcomed! Thank you