我将大小约为 650GB 的 BigQuery 数据集导出到 GCS 上的 Avro 文件,并运行数据流程序来处理这些 Avro 文件。但是,即使只处理一个大小约为 1.31GB 的 Avro 文件,也会遇到 OutOfMemoryError 异常。
我收到以下错误消息,似乎异常源于 AvroIO 和 Avro 库:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:260)
at org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:348)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:341)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139)
at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
at com.google.cloud.dataflow.sdk.runners.worker.AvroReader$AvroFileIterator.next(AvroReader.java:143)
at com.google.cloud.dataflow.sdk.runners.worker.AvroReader$AvroFileIterator.next(AvroReader.java:113)
at com.google.cloud.dataflow.sdk.util.ReaderUtils.readElemsFromReader(ReaderUtils.java:37)
at com.google.cloud.dataflow.sdk.io.AvroIO.evaluateReadHelper(AvroIO.java:638)
at com.google.cloud.dataflow.sdk.io.AvroIO.access$000(AvroIO.java:118)
at com.google.cloud.dataflow.sdk.io.AvroIO$Read$Bound$1.evaluate(AvroIO.java:294)
at com.google.cloud.dataflow.sdk.io.AvroIO$Read$Bound$1.evaluate(AvroIO.java:290)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner$Evaluator.visitTransform(DirectPipelineRunner.java:611)
at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:200)
at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:196)
at com.google.cloud.dataflow.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:109)
at com.google.cloud.dataflow.sdk.Pipeline.traverseTopologically(Pipeline.java:204)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner$Evaluator.run(DirectPipelineRunner.java:584)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner.run(DirectPipelineRunner.java:328)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner.run(DirectPipelineRunner.java:70)
at com.google.cloud.dataflow.sdk.Pipeline.run(Pipeline.java:145)
at com.htc.studio.bdi.dataflow.ActTranGenerator.main(ActTranGenerator.java:224)
对这个例外有什么建议吗?
谢谢!