hadoop - Hive 将多个分区的 HDFS 文件加载到表中

Question

我在 HDFS 中有一些两次分区的文件，其结构如下：

/user/hive/warehouse/datascience.db/simulations/datekey=20210506/coeff=0.5/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210506/coeff=0.75/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210506/coeff=1.0/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210507/coeff=0.5/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210507/coeff=0.75/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210507/coeff=1.0/data.parquet

并希望尽可能优雅地将这些加载到蜂巢表中。我知道这样的典型解决方案是首先将所有数据加载到非分区表中，然后使用此处提到的动态分区将所有数据传输到最终表

但是，我的文件在实际数据中没有 datekey 和 coeff 值，它只在文件名中，因为它是这样分区的。那么，当我将这些值加载到中间表中时，我将如何跟踪它们呢？

一种解决方法是对每个 coeff 值和 datekey 进行单独的load data inpath查询。这不需要中间表，但会很麻烦并且可能不是最佳的。

有没有更好的方法来做到这一点？

score 1 · Accepted Answer

典型的解决方案是在 hdfs 目录之上构建外部分区表：

create external table table_name (
column1 datatype, 
column2 datatype,
...
columnN datatype 
)
partitioned by (datekey int,
                coeff float)
STORED AS PARQUET
LOCATION '/user/hive/warehouse/datascience.db/simulations'

之后，恢复所有分区，此命令将扫描表位置并在 Hive 元数据中创建分区：

MSCK REPAIR TABLE table_name;

现在，您可以查询表列以及分区列并对其执行任何操作：按原样使用，或使用 insert .. select .. 等加载到另一个表中：

select 
    column1, 
    column2,
    ...
    columnN,
    --partition columns
    datekey,
    coeff
from table_name
where datekey = 20210506
;

hadoop - Hive 将多个分区的 HDFS 文件加载到表中

1 回答 1

Related

Reference