1

I want to store millions of time-series, where each point in time of every time-series is labeled with arbitrary set of tags. It appears I should use JSON array with tags in Snowflake:

CREATE TABLE timeseries (obj_id INT, ts DATE, tags VARIANT, val INT)
INSERT INTO timeseries (obj_id, ts, tags, val) VALUES (442243, '2017-01-01', parse_json('["red", "small", "cheap"]'), 1)
INSERT INTO timeseries (obj_id, ts, tags, val) VALUES (673124, '2017-01-01', parse_json('["red", "small", "expensive"]'), 2)
INSERT INTO timeseries (obj_id, ts, tags, val) VALUES (773235, '2017-01-01', parse_json('["black", "small", "cheap"]'), 3)

Now I want to see an average of all time-series labeled with "small" AND "cheap", e.g.

SELECT ts, AVG(val)
FROM timeseries
WHERE "small" IN tags AND "cheap" IN tags
GROUP BY ts

which would return:

ts, avg(val)
2017-01-01, 2

What is the right Snowflake syntax/schema/approach to achieve it? Note, I do NOT want to FLATTEN exploding the rows, I just want to filter out all the rows that are not 'cheap' and 'small'.

4

1 回答 1

1

您可以直接使用数组类型,而不是使用 JSON,例如:

CREATE TABLE ts2 (obj_id INT, ts DATE, tags ARRAY, val INT);
INSERT INTO ts2 (obj_id, ts, tags, val) select 442243, '2017-01-01', ARRAY_CONSTRUCT('red', 'small', 'cheap'), 1;
INSERT INTO ts2 (obj_id, ts, tags, val) select 673124, '2017-02-01', ARRAY_CONSTRUCT('red', 'small', 'expensive'), 2;
INSERT INTO ts2 (obj_id, ts, tags, val) select 773235, '2017-01-01', ARRAY_CONSTRUCT('black', 'small', 'cheap'), 3;

VALUES 子句不能使用 ARRAY_CONSTRUCT 等函数,但 INSERT-SELECT 会起作用。(您也可以使用 JSON 和 VARIANT 类型执行此操作,但是您需要使用键名标记值,并在插入中使用 PARSE_JSON。)

然后查询只包含您选择的两个标签的行,使用如下查询:

select 
  obj_id,
  tags
from ts2
where ARRAY_CONTAINS('small'::variant, tags)
  and ARRAY_CONTAINS('cheap'::variant, tags)
; 
于 2018-01-04T21:42:42.457 回答