I've tried to use Spark AQE for dynamically coalescing shuffle partitions before writing. On default, spark creates too many files with small sizes. However, AQE feature claims that enabling it will optimize this and merge small files into bigger ones. This is critical for aws s3 users like me because having too many small files causes network congestion when trying to read the small files later.
Here is my spark configuration:
[('spark.executor.extraJavaOptions', '-XX:+UseG1GC'),
('spark.executor.id', 'driver'),
('spark.driver.extraJavaOptions', '-XX:+UseG1GC'),
('spark.driver.memory', '16g'),
('spark.sql.adaptive.enabled', 'true'),
('spark.app.name', 'pyspark-shell'),
('spark.sql.adaptive.coalescePartitions.minPartitionNum', '5'),
('spark.app.startTime', '1614929855179'),
('spark.sql.adaptive.coalescePartitions.enabled', 'true'),
('spark.driver.port', '34447'),
('spark.executor.memory', '16g'),
('spark.driver.host', '2b7345ffcf3e'),
('spark.rdd.compress', 'true'),
('spark.serializer.objectStreamReset', '100'),
('spark.master', 'local[*]'),
('spark.submit.pyFiles', ''),
('spark.submit.deployMode', 'client'),
('spark.app.id', 'local-1614929856024'),
('spark.ui.showConsoleProgress', 'true')]
The required parameters for AQE are all enabled, I also see AdaptiveSparkPlan isFinalPlan=true
in the execution plan. When I run a small task (read a csv, do some calculations, do a join operation and write into parquet), it still generates too many small sized files in the parquet folder. Am i missing something or this feature is not doing what it promised?