1

I'm investigating the feasibility of using BigQuery to store sensor data in time series. The intent is to store the data in BQ and process it in Pandas... so far so good... Pandas can interpret a TIMESTAMP field index and create a Series.

An additional requirement is that the data support arbitrary tags as key/value pairs (e.g. job_id=1234, task_id=5678). BigQuery can support this nicely with REPEATED fields of type RECORD:

                   {'fields':
                       [
                           {
                               "mode": "NULLABLE",
                               "name": "timestamp",
                               "type": "TIMESTAMP"
                           },
                           {
                               "mode": "REPEATED",
                               "name": "tag",
                               "type": "RECORD",
                               "fields":
                               [
                                    {
                                        "name":"name",
                                        "type":"STRING"
                                    },
                                    {
                                        "name":"value",
                                        "type":"STRING"
                                    },
                                    {
                                        "mode": "NULLABLE",
                                        "name": "measurement_1",
                                        "type": "FLOAT"
                                    },
                                    {
                                        "mode": "NULLABLE",
                                        "name": "measurement_2",
                                        "type": "FLOAT"
                                    },
                                    {
                                        "mode": "NULLABLE",
                                        "name": "measurement_3",
                                        "type": "FLOAT"
                                    },
                                ]
                            },
                       ]
                   }

This works great for storing the data and it even works great for querying if I only need to filter on a single key/value combination

SELECT measurement_1 FROM measurements 
WHERE tag.name = 'job_id' AND tag.value = '1234'

However, I also need to be able to combine sets of tags in query expressions and I can't seem to make this work. For example this query returns no result

SELECT measurement_1 FROM measurements 
WHERE tag.name = 'job_id' AND tag.value = '1234'
      AND tag.name = 'task_id' AND tag.value = '5678'

Questions: Is it possible to formulate a query to do what I want using this schema? What is the recommended way to attach this type of variable data to an otherwise fixed schema in Big Query?

Thanks for any help or suggestions!

Note: If you're thinking this looks like a great fix for InfluxDB it's because that's what I've been using thus far. The seemingly insurmountable issue is the amount of series cardinality in my data set, so I'm looking for alternatives.

4

2 回答 2

1

BigQuery 旧版 SQL

SELECT measurement_1 FROM measurements 
OMIT RECORD IF
  SUM((tag.name = 'job_id' AND tag.value = '1234')
   OR (tag.name = 'task_id' AND tag.value = '5678')) < 2

BigQuery 标准 SQL

SELECT measurement_1 FROM measurements 
WHERE (
  SELECT COUNT(1) FROM UNNEST(tag) 
  WHERE ((name = 'job_id' AND value = '1234')
      OR (name = 'task_id' AND value = '5678'))
) >= 2
于 2016-09-24T01:44:57.220 回答
0

重复是存储数据系列、集合等的好方法。
为了从重复字段中过滤出一个兴趣的值,我将使用以下模板

SELECT 
    MAX( IF( filter criteria,  value_to_pull, null)) WITHIN RECORD AS some_name
FROM <table>

在您的情况下,它将是以下内容:

SELECT
  MAX(IF(tag.name = 'job_id' AND tag.value = '1234', measurement_1, NULL)) WITHIN RECORD AS job_1234_meassurement_1,
  MAX(IF(tag.name = 'task_id' AND tag.value = '5678', measurement_1, NULL)) WITHIN RECORD AS task_5678_meassurement_1,
  FROM measurements 
于 2016-09-25T04:15:17.503 回答