2

我正在将JOIN用户更改日志样式表中的数据查找到具有匹配 ID 的事件表中

表格如下:

项目事件

图式

timestamp       TIMESTAMP
event_id        STRING  
user_id         STRING
data            STRING  

示例数据

| timestamp                   | event_id  | user_id | data             |
|-----------------------------|-----------|---------|------------------|
| 2020-08-22 17:01:18.807 UTC | hHZuTE8Y= | ABC123  | {"some":"json" } |
| 2020-08-20 16:57:28.022 UTC | tF5Gky8Q= | ZXY432  | {"foo":"item" }  |
| 2020-08-15 16:44:25.607 UTC | 1dOU8pOo= | ABC123  | {"bar":"val" }   |

users_changelog

图式

timestamp       TIMESTAMP
event_id        STRING  
operation       STRING  
user_id         STRING
data            STRING  

示例数据

| timestamp                   | event_id  | operation | user_id | data                |
|-----------------------------|-----------|-----------|---------|---------------------|
| 2020-08-30 12:50:59.036 UTC | mGdNKy+o= | DELETE    | ABC123  | {"name":"removed" } |
| 2020-08-20 16:50:59.036 UTC | mGdNKy+o= | UPDATE    | ABC123  | {"name":"final" }   |
| 2020-08-05 20:45:36.936 UTC | mIICo9LY= | UPDATE    | ZXY432  | {"name":"asdf" }    |
| 2020-08-05 20:45:21.023 UTC | nEDKyCks= | UPDATE    | ABC123  | {"name":"other" }   |
| 2020-08-03 12:40:49.036 UTC | GxnbUqQ0= | CREATE    | ABC123  | {"name":"initial" } |
| 1970-01-01 00:00:00 UTC     | 1y+6fVWo= | IMPORT    | ZXY432  | {"name":"test" }    |

注意:操作可以是“CREATE”、“UPDATE”、“DELETE”或“IMPORT”。由于用户可以多次更新,因此可以有多个具有相同 user_id 的行

目标是在用户表中显示与匹配 ID 的最新操作相关联的 event_id 和数据列。使用示例数据,预期结果将是:

| event_id  | event_data       | user_id | user_data         |
|-----------|------------------|---------|-------------------|
| hHZuTE8Y= | {"some":"json" } | ABC123  | {"name":"final" } |
| tF5Gky8Q= | {"foo":"item" }  | ZXY432  | {"name":"asdf" }  |
| 1dOU8pOo= | {"bar":"val" }   | ABC123  | {"name":"other" } |

我尝试了以下方法,但它会产生重复的行(更改日志表中的每一行都有一个匹配的 id):

SELECT
  events.event_id as event_id,
  events.data as event_data,
  users.user_id as user_id,
  users.data as user_data
FROM my_project.my_dataset.project_events as events
LEFT JOIN my_project.my_dataset.users_changelog as users
ON events.user_id = users.user_id AND users.timestamp <= events.timestamp
4

2 回答 2

1

Below is for BigQuery Standard SQL

#standardSQL
SELECT event_id, data AS event_data, user_id, 
  ( SELECT data
    FROM UNNEST(arr) rec
    WHERE rec.timestamp < t.timestamp
    ORDER BY rec.timestamp DESC
    LIMIT 1
  ) AS user_data
FROM (
  SELECT
    ANY_VALUE(events).*,
    ARRAY_AGG(STRUCT(users.data, users.timestamp)) arr
  FROM `my_project.my_dataset.project_events` AS events
  LEFT JOIN `my_project.my_dataset.users_changelog` AS users
  ON events.user_id = users.user_id 
  GROUP BY FORMAT('%t', events)
) t    

If to apply to sample data from your question - the output is

Row event_id        event_data          user_id     user_data    
1   hHZuTE8Y=       {"some":"json" }    ABC123      {"name":"final" }    
2   tF5Gky8Q=       {"foo":"item" }     ZXY432      {"name":"asdf" }     
3   1dOU8pOo=       {"bar":"val" }      ABC123      {"name":"other" }    
于 2020-09-03T21:34:35.783 回答
1

使用 SQL Server,我使用 ROW_NUMBER() 路由来检索您的目标输出:

SELECT event_id,
       event_data,
       user_id,
       user_data
FROM (
      SELECT 
        events.event_id as event_id,
        events.data as event_data,
        users.user_id as user_id,
        users.data as user_data,
        ROW_NUMBER() OVER (PARTITION BY users.user_id, events.event_id ORDER BY users.timestamp desc) AS Count_by_User
      FROM #TEMP1 as events
      LEFT JOIN #TEMP2 as users
            ON events.user_id = users.user_id AND users.timestamp <= events.timestamp
) as a 
WHERE Count_by_User = 1

输出:

event_id    event_data          user_id  user_data
1dOU8pOo=   {"bar":"val" }      ABC123  {"name":"other" }  
hHZuTE8Y=   {"some":"json" }    ABC123  {"name":"final" }  
tF5Gky8Q=   {"foo":"item" }     ZXY432  {"name":"asdf" }   

这是我用来生成测试表的代码(如果其他人想验证):

create table #TEMP1
(timestamp  VARCHAR(max), event_id  VARCHAR(max), user_id VARCHAR(max) , data VARCHAR(max))
INSERT INTO #TEMP1 (timestamp, event_id, user_id , data)
VALUES
('2020-08-22 17:01:18.807 UTC' , 'hHZuTE8Y=' , 'ABC123'  , '{"some":"json" }' ),
('2020-08-20 16:57:28.022 UTC' , 'tF5Gky8Q=' , 'ZXY432'  , '{"foo":"item" } ' ),
('2020-08-15 16:44:25.607 UTC' , '1dOU8pOo=' , 'ABC123'  , '{"bar":"val" }  ' )


create table #TEMP2
(timestamp  VARCHAR(max), event_id  VARCHAR(max), operation VARCHAR(MAX), user_id VARCHAR(max) , data VARCHAR(max))

INSERT INTO #TEMP2 (timestamp, event_id, operation, user_id , data)
VALUES
('2020-08-30 12:50:59.036 UTC' , 'mGdNKy+o=' , 'DELETE'    , 'ABC123'  , '{"name":"removed" }'),
('2020-08-20 16:50:59.036 UTC' , 'mGdNKy+o=' , 'UPDATE'    , 'ABC123'  , '{"name":"final" }  '),
('2020-08-05 20:45:36.936 UTC' , 'mIICo9LY=' , 'UPDATE'    , 'ZXY432'  , '{"name":"asdf" }   '),
('2020-08-05 20:45:21.023 UTC' , 'nEDKyCks=' , 'UPDATE'    , 'ABC123'  , '{"name":"other" }  '),
('2020-08-03 12:40:49.036 UTC' , 'GxnbUqQ0=' , 'CREATE'    , 'ABC123'  , '{"name":"initial" }'),
('1970-01-01 00:00:00 UTC'     , '1y+6fVWo=' , 'IMPORT'    , 'ZXY432'  , '{"name":"test" }   ')
于 2020-09-03T21:59:47.650 回答