amazon-s3 - 如何提高对来自 Athena 的 s3 数据的查询性能

Question

我已经将存储在 S3 中的数据以这样的配置单元格式进行了分区。

bucket/year=2017/month=3/date=1/filename.json
bucket/year=2017/month=3/date=2/filename1.json
bucket/year=2017/month=3/date=3/filename2.json

每个分区有大约 1,000,000 条记录。为此，我在 Athena 中创建了表和分区。

现在从 Athena 运行查询

select count(*) from mts_data_1 where year='2017' and month='3' and date='1'

此查询需要 1800 秒来扫描 1,000,000 条记录。

所以我的问题是如何提高这个查询性能？

score 1 · Accepted Answer

我认为问题在于 Athena 必须从 S3 读取这么多文件。250 MB 不是很多数据，但 1,000,000 个文件是很多文件。如果您减少文件数量，Athena 查询性能将显着提高，并且压缩聚合文件会有所帮助。一天的分区需要多少个文件？即使使用一分钟的分辨率，您也需要少于 1,500 个文件。如果当前查询时间约为 30 分钟，您可以轻松地从少得多的时间开始。

有许多用于聚合和压缩记录的选项：

AWS 的Kinesis Firehose是解决此类问题的一种相当简单的方法。
像Apache NiFi这样的流数据处理工具将提供更丰富的转换、聚合和压缩选项。我写了一篇关于使用 Apache NiFi 为 Athena 将数据流式传输到 S3的博客文章，涵盖了这些相同的问题。

amazon-s3 - 如何提高对来自 Athena 的 s3 数据的查询性能

1 回答 1

Related

Reference