2

仅当 HIVE 中不存在分区时,如何插入覆盖分区?

就像标题一样。我正在做一些总是需要重写配置单元表的东西。我有具有多个分区的表,当我在更改后重新运行代码时,我只想插入新分区而不更改现有分区。

4

2 回答 2

2

您可以加入现有的分区列表并添加它是 NULL 条件(仅不加入)。您也可以使用 NOT EXISTS (它将生成与 Hive 中的左连接相同的计划),如下所示:

   insert overwrite table target_table partition (partition_key)
    select col1, ... coln, s.partition_key
      from source s 
           left join (select distinct partition_key --existing partitions
                       from target_table
                     ) t on s.partition_key=t.partition_key
     where t.partition_key is NULL; --no partitions exists in the target
于 2019-09-06T06:07:42.640 回答
0

一种选择是连接(在分区列上作为键的左连接)具有来自目标表的不同分区列的源数据集,并过滤掉共同的分区。你知道我的意思; 您的 Hive 查询应如下所示:

insert overwrite table target_table partition (partition_column1, partition_column2, ..., partition_columnN)
select
   src.column1,
   src.column2,
   ....,
   src.columnN,
   src.partition_column1,
   src.partition_column2,
   ....,
   src.partition_columnN
from
   source src 
   left join
      (
         select distinct
            partition_column1,
            partition_column2,
            ....,
            partition_columnN
         from
            target
      )
      tgt 
      on src.partition_column1 = tgt.partition_column1 
      and src.partition_column1 = tgt.partition_column1
      ...
      src.partition_columnN = tgt.partition_columnN 
where
   tgt.partition_column1 is null 
   or tgt.partition_column2 is null
   ...
   tgt.partition_columnN is null;

下面给出这个逻辑的简单演示:

让我们创建两个名为 orders 和 orders_source 的表。order 表将是目标表,orders_source 是源表。为简单起见,我对两个表都使用了类似的架构。

CREATE TABLE `orders`(
  `id` int, 
  `customer_id` int, 
  `shipper_id` int)
PARTITIONED BY ( 
  `state` string,
  `order_date` date)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
  'orc.bloom.filter.columns'='id,customer_id', 
  'orc.compress'='SNAPPY', 
  'orc.compress.size'='262144', 
  'orc.create.index'='true', 
  'orc.row.index.stride'='3000', 
  'orc.stripe.size'='268435456');

CREATE TABLE `orders_source`(
  `id` int, 
  `customer_id` int, 
  `shipper_id` int)
PARTITIONED BY ( 
  `state` string,
  `order_date` date)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
  'orc.bloom.filter.columns'='id,customer_id', 
  'orc.compress'='SNAPPY', 
  'orc.compress.size'='262144', 
  'orc.create.index'='true', 
  'orc.row.index.stride'='3000', 
  'orc.stripe.size'='268435456');

接下来,插入一些示例记录以测试逻辑:

set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nonstrict;

insert overwrite table orders partition (state, order_date) 
select
   orde.id,
   orde.customer_id,
   orde.shipper_id,
   orde.state,
   orde.order_date 
from
   (
      select
         10240 as id,
         20480 as customer_id,
         30720 as shipper_id,
         'CA' as state,
         '2019-09-01' as order_date 
      union all
      select
         10241 as id,
         20481 as customer_id,
         30721 as shipper_id,
         'GA' as state,
         '2019-09-01' as order_date
   )
   orde;

insert overwrite table orders_source partition (state, order_date) 
select
   orso.id,
   orso.customer_id,
   orso.shipper_id,
   orso.state,
   orso.order_date 
from
   (
      select
         10240 as id,
         20480 as customer_id,
         30720 as shipper_id,
         'CA' as state,
         '2019-09-01' as order_date 
      union all
      select
         10242 as id,
         20482 as customer_id,
         30722 as shipper_id,
         'CA' as state,
         '2019-09-02' as order_date 
      union all
      select
         10243 as id,
         20483 as customer_id,
         30723 as shipper_id,
         'FL' as state,
         '2019-09-02' as order_date 
      union all
      select
         10244 as id,
         20484 as customer_id,
         30724 as shipper_id,
         'TX' as state,
         '2019-09-02' as order_date
   )
   orso;

现在,让我们在运行我们的实际业务逻辑之前检查我们在两个表中插入的数据:

hive (default)> select * from orders_source;
OK
orders_source.id    orders_source.customer_id   orders_source.shipper_id    orders_source.state orders_source.order_date
10240   20480   30720   CA  2019-09-01
10242   20482   30722   CA  2019-09-02
10243   20483   30723   FL  2019-09-02
10244   20484   30724   TX  2019-09-02
Time taken: 0.085 seconds, Fetched: 4 row(s)

hive (default)> select * from orders;
OK
orders.id   orders.customer_id  orders.shipper_id   orders.state    orders.order_date
10240   20480   30720   CA  2019-09-01
10241   20481   30721   GA  2019-09-01
Time taken: 0.073 seconds, Fetched: 2 row(s)

接下来,执行我们的逻辑,从源表中选择记录并插入到目标表中。您可以运行以下查询:

hive (default)> select
   orso.id,
   orso.customer_id,
   orso.shipper_id,
   orso.state,
   orso.order_date 
from
   orders_source orso 
   left join
      (
         select distinct
            state,
            order_date 
         from
            orders
      )
      orde 
      on orso.state = orde.state 
      and orso.order_date = orde.order_date 
where
   orde.state is null 
   or orde.order_date is null;
OK
orso.id orso.customer_id    orso.shipper_id orso.state  orso.order_date
10243   20483   30723   FL  2019-09-02
10244   20484   30724   TX  2019-09-02
10242   20482   30722   CA  2019-09-02
Time taken: 11.113 seconds, Fetched: 3 row(s)

可以看到上面的结果。

最后通过发出以下查询将记录插入目标表:

insert overwrite table orders partition (state, order_date)
select
   orso.id,
   orso.customer_id,
   orso.shipper_id,
   orso.state,
   orso.order_date 
from
   orders_source orso 
   left join
      (
         select distinct
            state,
            order_date 
         from
            orders
      )
      orde 
      on orso.state = orde.state 
      and orso.order_date = orde.order_date 
where
   orde.state is null 
   or orde.order_date is null;

现在,让我们在插入操作后验证目标表中的数据。

hive (default)> select * from orders;
OK
orders.id   orders.customer_id  orders.shipper_id   orders.state    orders.order_date
10240   20480   30720   CA  2019-09-01
10242   20482   30722   CA  2019-09-02
10243   20483   30723   FL  2019-09-02
10241   20481   30721   GA  2019-09-01
10244   20484   30724   TX  2019-09-02
Time taken: 0.074 seconds, Fetched: 5 row(s)

而已。你都准备好了!

于 2019-09-06T07:54:25.960 回答