仅当 HIVE 中不存在分区时,如何插入覆盖分区?
就像标题一样。我正在做一些总是需要重写配置单元表的东西。我有具有多个分区的表,当我在更改后重新运行代码时,我只想插入新分区而不更改现有分区。
仅当 HIVE 中不存在分区时,如何插入覆盖分区?
就像标题一样。我正在做一些总是需要重写配置单元表的东西。我有具有多个分区的表,当我在更改后重新运行代码时,我只想插入新分区而不更改现有分区。
您可以加入现有的分区列表并添加它是 NULL 条件(仅不加入)。您也可以使用 NOT EXISTS (它将生成与 Hive 中的左连接相同的计划),如下所示:
insert overwrite table target_table partition (partition_key)
select col1, ... coln, s.partition_key
from source s
left join (select distinct partition_key --existing partitions
from target_table
) t on s.partition_key=t.partition_key
where t.partition_key is NULL; --no partitions exists in the target
一种选择是连接(在分区列上作为键的左连接)具有来自目标表的不同分区列的源数据集,并过滤掉共同的分区。你知道我的意思; 您的 Hive 查询应如下所示:
insert overwrite table target_table partition (partition_column1, partition_column2, ..., partition_columnN)
select
src.column1,
src.column2,
....,
src.columnN,
src.partition_column1,
src.partition_column2,
....,
src.partition_columnN
from
source src
left join
(
select distinct
partition_column1,
partition_column2,
....,
partition_columnN
from
target
)
tgt
on src.partition_column1 = tgt.partition_column1
and src.partition_column1 = tgt.partition_column1
...
src.partition_columnN = tgt.partition_columnN
where
tgt.partition_column1 is null
or tgt.partition_column2 is null
...
tgt.partition_columnN is null;
下面给出这个逻辑的简单演示:
让我们创建两个名为 orders 和 orders_source 的表。order 表将是目标表,orders_source 是源表。为简单起见,我对两个表都使用了类似的架构。
CREATE TABLE `orders`(
`id` int,
`customer_id` int,
`shipper_id` int)
PARTITIONED BY (
`state` string,
`order_date` date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
'orc.bloom.filter.columns'='id,customer_id',
'orc.compress'='SNAPPY',
'orc.compress.size'='262144',
'orc.create.index'='true',
'orc.row.index.stride'='3000',
'orc.stripe.size'='268435456');
CREATE TABLE `orders_source`(
`id` int,
`customer_id` int,
`shipper_id` int)
PARTITIONED BY (
`state` string,
`order_date` date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
'orc.bloom.filter.columns'='id,customer_id',
'orc.compress'='SNAPPY',
'orc.compress.size'='262144',
'orc.create.index'='true',
'orc.row.index.stride'='3000',
'orc.stripe.size'='268435456');
接下来,插入一些示例记录以测试逻辑:
set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nonstrict;
insert overwrite table orders partition (state, order_date)
select
orde.id,
orde.customer_id,
orde.shipper_id,
orde.state,
orde.order_date
from
(
select
10240 as id,
20480 as customer_id,
30720 as shipper_id,
'CA' as state,
'2019-09-01' as order_date
union all
select
10241 as id,
20481 as customer_id,
30721 as shipper_id,
'GA' as state,
'2019-09-01' as order_date
)
orde;
insert overwrite table orders_source partition (state, order_date)
select
orso.id,
orso.customer_id,
orso.shipper_id,
orso.state,
orso.order_date
from
(
select
10240 as id,
20480 as customer_id,
30720 as shipper_id,
'CA' as state,
'2019-09-01' as order_date
union all
select
10242 as id,
20482 as customer_id,
30722 as shipper_id,
'CA' as state,
'2019-09-02' as order_date
union all
select
10243 as id,
20483 as customer_id,
30723 as shipper_id,
'FL' as state,
'2019-09-02' as order_date
union all
select
10244 as id,
20484 as customer_id,
30724 as shipper_id,
'TX' as state,
'2019-09-02' as order_date
)
orso;
现在,让我们在运行我们的实际业务逻辑之前检查我们在两个表中插入的数据:
hive (default)> select * from orders_source;
OK
orders_source.id orders_source.customer_id orders_source.shipper_id orders_source.state orders_source.order_date
10240 20480 30720 CA 2019-09-01
10242 20482 30722 CA 2019-09-02
10243 20483 30723 FL 2019-09-02
10244 20484 30724 TX 2019-09-02
Time taken: 0.085 seconds, Fetched: 4 row(s)
hive (default)> select * from orders;
OK
orders.id orders.customer_id orders.shipper_id orders.state orders.order_date
10240 20480 30720 CA 2019-09-01
10241 20481 30721 GA 2019-09-01
Time taken: 0.073 seconds, Fetched: 2 row(s)
接下来,执行我们的逻辑,从源表中选择记录并插入到目标表中。您可以运行以下查询:
hive (default)> select
orso.id,
orso.customer_id,
orso.shipper_id,
orso.state,
orso.order_date
from
orders_source orso
left join
(
select distinct
state,
order_date
from
orders
)
orde
on orso.state = orde.state
and orso.order_date = orde.order_date
where
orde.state is null
or orde.order_date is null;
OK
orso.id orso.customer_id orso.shipper_id orso.state orso.order_date
10243 20483 30723 FL 2019-09-02
10244 20484 30724 TX 2019-09-02
10242 20482 30722 CA 2019-09-02
Time taken: 11.113 seconds, Fetched: 3 row(s)
可以看到上面的结果。
最后通过发出以下查询将记录插入目标表:
insert overwrite table orders partition (state, order_date)
select
orso.id,
orso.customer_id,
orso.shipper_id,
orso.state,
orso.order_date
from
orders_source orso
left join
(
select distinct
state,
order_date
from
orders
)
orde
on orso.state = orde.state
and orso.order_date = orde.order_date
where
orde.state is null
or orde.order_date is null;
现在,让我们在插入操作后验证目标表中的数据。
hive (default)> select * from orders;
OK
orders.id orders.customer_id orders.shipper_id orders.state orders.order_date
10240 20480 30720 CA 2019-09-01
10242 20482 30722 CA 2019-09-02
10243 20483 30723 FL 2019-09-02
10241 20481 30721 GA 2019-09-01
10244 20484 30724 TX 2019-09-02
Time taken: 0.074 seconds, Fetched: 5 row(s)
而已。你都准备好了!