10

I have thousands of individual json files (corresponding to one Table row) stored in s3 with the following path: s3://my-bucket/<date>/dataXX.json

When I create my table in DDL, is it possible to have the data partitioned by the present in the S3 path ? (or at least add the value in a new column)

Thanks

4

4 回答 4

11

Sadly this is not supported in Athena. For partitioning to work with folders, there are requirements on how the folder must be named.

e.g. s3://my-bucket/{columnname}={columnvalue}/data.json

In your case, you can still use partitioning if you add those partitions manually to the table.

e.g. ALTER TABLE tablename ADD PARTITION (datecolumn='2017-01-01') location 's3://my-bucket/2017-01-01/

The AWS docs have some good examples on that topic.

AWS Athena Partitioning

于 2017-03-01T10:04:30.843 回答
3

It is possible to do this now using storage.location.template. This will partition by some part of your path. Be sure to NOT include the new column in the column list, as it will automatically be added. There are a lot of options you can search to tweak this for your date example. I used "id" to show the simplest version i could think of.

CREATE EXTERNAL TABLE `some_table`(
  `col1` bigint, 
PARTITIONED BY (
  `id` string
  )
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
  's3://path/bucket/'
TBLPROPERTIES (
  'has_encrypted_data'='false',
  'projection.enabled'='true', 
  'projection.id.type' = 'injected',
  'storage.location.template'='s3://path/bucket/${id}/'
  )

official docs: https://docs.amazonaws.cn/en_us/athena/latest/ug/partition-projection-dynamic-id-partitioning.html

于 2021-06-15T20:22:03.697 回答
1

Its not necessary to do this manually. Setup a glue crawler and it will pick-up the folder( in the prefix) as a partition, if all the folders in the path has the same structure and all the data has the same schema design.

Put it will name the partition as partition0. You can go into edit-schema and change the name of this partition to date or whatever you like.

But make sure you go into your glue crawler and under "configuration options" select the option - "Add new columns only". Otherwise on the next glue-crawler run it will reset the partition name back to partition0.

于 2018-11-09T00:28:24.197 回答
0

You need to name each S3 folder like this picture:

image

With Athena set up, specify dt for the partition:

image

After that, run MSCK REPAIR TABLE <your table name>; on Athena

于 2020-04-16T01:54:08.567 回答