CREATE EXTERNAL TABLE IF NOT EXISTS TestingTable1 (This is the MAIN table through which comparisons need to be made)
(
BUYER_ID BIGINT,
ITEM_ID BIGINT,
CREATED_TIME STRING
)
这是上面第一个表中的数据
**BUYER_ID** | **ITEM_ID** | **CREATED_TIME**
--------------+------------------+-------------------------
1015826235 220003038067 *2001-11-03 19:40:21*
1015826235 300003861266 2001-11-08 18:19:59
1015826235 140002997245 2003-08-22 09:23:17
1015826235 *210002448035* 2001-11-11 22:21:11
这是 Hive 中的第二张表 - 它还包含有关我们正在购买的物品的信息。
CREATE EXTERNAL TABLE IF NOT EXISTS TestingTable2
(
USER_ID BIGINT,
PURCHASED_ITEM ARRAY<STRUCT<PRODUCT_ID: BIGINT,TIMESTAMPS:STRING>>
)
这是上面第二个表(TestingTable2
)中的数据-
**USER_ID** **PURCHASED_ITEM**
1015826235 [{"product_id":220003038067,"timestamps":"1004941621"}, {"product_id":300003861266,"timestamps":"1005268799"}, {"product_id":140002997245,"timestamps":"1061569397"},{"product_id":200002448035,"timestamps":"1005542471"}]
比较以便满足以下场景TestingTable2
。TestingTable1
在比较 from 之后,找到与 TestTable1 对应的AND不PRODUCT_ID
匹配的AND 。TIMESTAMPS
TestingTable2
ITEM_ID
CREATED_TIME
BUYER_ID(USER_ID)
TestingTable1
因此,如果您查看TestingTable2
此(最后)ITEM_ID 210002448035
来自TestingTable1
的数据与数据不匹配,TestingTable2
PRODUCT_ID- 200002448035
并且与时间戳类似。所以我想使用 HiveQL 查询显示以下结果。
**BUYER_ID** | **ITEM_ID** | **CREATED_TIME** | **PRODUCT_ID** | **TIMESTAMPS**
--------------+------------------+--------------------------------+------------------------+----------------------
1015826235 *210002448035* 2001-11-11 22:21:11 200002448035 1005542471
1015826235 220003038067 *2001-11-03 19:40:21* 220003038067 1004941621
谁能帮我这个。因为我是 HiveQL 的新手,所以有很多问题。
更新:-
我已经写了这个查询,但它没有按照我想要的方式工作。
select * from
(select * from
(select user_id, prod_and_ts.product_id as product_id, prod_and_ts.timestamps as timestamps
from testingtable2 LATERAL VIEW
explode(purchased_item) exploded_table as prod_and_ts)
prod_and_ts
LEFT OUTER JOIN testingtable1
ON ( prod_and_ts.user_id = testingtable1.buyer_id AND testingtable1.item_id = prod_and_ts.product_id
AND prod_and_ts.timestamps = UNIX_TIMESTAMP (testingtable1.created_time)
)
where testingtable1.buyer_id IS NULL)
set_a LEFT OUTER JOIN testingtable1
ON (set_a.user_id = testingtable1.buyer_id AND
( set_a.product_id = testingtable1.item_id OR set_a.timestamps = UNIX_TIMESTAMP(testingtable1.created_time) )
);
又一更新
根据user1166147
评论。我根据他的查询写了我的查询。在蜂巢中,我猜INNER JOIN
是简单地写的JOIN
。
这是我的以下查询。
select * from (select t2.buyer_id, t2.item_id, t2.created_time as created_time, subq.user_id, subq.product_id, subq.timestamps as timestamps
from
(select user_id, prod_and_ts.product_id as product_id, prod_and_ts.timestamps as timestamps from testingtable2 lateral view explode(purchased_item) exploded_table as prod_and_ts) subq JOIN testingtable1 t2 on t2.buyer_id = subq.user_id
AND subq.timestamps = unix_timestamp(t2.created_time)
WHERE (subq.product_id <> t2.item_id)
union all
select t2.buyer_id, t2.item_id as item_id, t2.created_time, subq.user_id, subq.product_id as product_id, subq.timestamps
from
(select user_id, prod_and_ts.product_id as product_id, prod_and_ts.timestamps as timestamps from testingtable2 lateral view explode(purchased_item) exploded_table as prod_and_ts) subq JOIN testingtable1 t2 on t2.buyer_id = subq.user_id
and subq.product_id = t2.item_id
WHERE (subq.timestamps <> unix_timestamp(t2.created_time))) unionall;
运行上述查询后,我得到的结果为零。
最后一次更新:-
我的错,我在表格中没有准确的数据,所以这就是我没有得到结果的原因。是的,它正在执行上述实际查询。