11

We currently generate a daily CSV export that we upload to an S3 bucket, into the following structure:

<report-name>
|--reportDate-<date-stamp>
    |-- part0.csv.gz
    |-- part1.csv.gz

We want to be able to run reports partitioned by daily export.

According to this page, you can partition data in Redshift Spectrum by a key which is based on the source S3 folder where your Spectrum table sources its data. However, from the example, it looks like you need an ALTER statement for each partition:

alter table spectrum.sales_part
add partition(saledate='2008-01-01') 
location 's3://bucket/tickit/spectrum/sales_partition/saledate=2008-01/';

alter table spectrum.sales_part
add partition(saledate='2008-02-01') 
location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-02/';

Is there any way to set the table up so that data is automatically partitioned by the folder it comes from, or do we need a daily job to ALTER the table to add that day's partition?

4

2 回答 2

12

解决方案1:

每个表最多可以创建 20000 个分区。您可以创建一个一次性脚本来为所有未来的 s3 分区文件夹添加分区(最多 20k)。

例如。

如果文件夹 s3://bucket/ticket/spectrum/sales_partition/saledate=2017-12/ 不存在,您甚至可以为此添加分区。

alter table spectrum.sales_part
add partition(saledate='2017-12-01') 
location 's3://bucket/tickit/spectrum/sales_partition/saledate=2017-12/';

解决方案2:

https://aws.amazon.com/blogs/big-data/data-lake-ingestion-automatically-partition-hive-external-tables-with-aws/

于 2017-11-10T07:17:22.327 回答
1

另一种精确的方法:创建一个在 S3 存储桶的 ObjectCreated 通知上触发的 Lambda 作业,然后运行 ​​SQL 以添加分区:

alter table tblname ADD IF NOT EXISTS PARTITION(分区子句) localtion s3://mybucket/localtion

于 2018-12-24T12:39:58.790 回答