I want to store millions of time-series, where each point in time of every time-series is labeled with arbitrary set of tags. It appears I should use JSON array with tags in Snowflake:
CREATE TABLE timeseries (obj_id INT, ts DATE, tags VARIANT, val INT)
INSERT INTO timeseries (obj_id, ts, tags, val) VALUES (442243, '2017-01-01', parse_json('["red", "small", "cheap"]'), 1)
INSERT INTO timeseries (obj_id, ts, tags, val) VALUES (673124, '2017-01-01', parse_json('["red", "small", "expensive"]'), 2)
INSERT INTO timeseries (obj_id, ts, tags, val) VALUES (773235, '2017-01-01', parse_json('["black", "small", "cheap"]'), 3)
Now I want to see an average of all time-series labeled with "small" AND "cheap", e.g.
SELECT ts, AVG(val)
FROM timeseries
WHERE "small" IN tags AND "cheap" IN tags
GROUP BY ts
which would return:
ts, avg(val)
2017-01-01, 2
What is the right Snowflake syntax/schema/approach to achieve it? Note, I do NOT want to FLATTEN exploding the rows, I just want to filter out all the rows that are not 'cheap' and 'small'.