我们计划将时间序列传感器数据存储在 Cassandra 中。每个传感器在每个采样时间点可以有多个数据点。我想将每个设备的所有数据点存储在一起。
我的一个想法是为我们可能收集的各种数据类型创建所有潜在的列:
CREATE TABLE ddata (
deviceID int,
day timestamp,
timepoint timestamp,
aparentPower int,
actualPower int,
actualEnergy int,
temperature float,
humidity float,
ppmCO2 int,
etc, etc, etc...
PRIMARY KEY ((deviceID,day),timepoint)
) WITH
clustering order by (timepoint DESC);
insert into ddata (deviceID,day,timepoint,temperature,humidity) values (1000001,'2013-09-02','2013-09-02 00:00:04',93,97.3);
deviceid | day | timepoint | actualenergy | actualpower | aparentpower | event | humidity | ppmco2 | temperature
----------+--------------------------+--------------------------+--------------+-------------+--------------+-------+----------+--------+-------------
1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:04-0700 | null | null | null | null | 97.3 | null | 93
1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:03-0700 | null | null | null | null | null | null | 92
1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:02-0700 | null | null | null | null | null | null | 91
1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:01-0700 | null | null | null | null | null | null | 90
另一个想法是创建一个给定设备可能报告的各种数据点的地图集合:
CREATE TABLE ddata (
deviceID int,
day timestamp,
timepoint timestamp,
feeds map<text,int>,
PRIMARY KEY ((deviceID,day),timepoint)
) WITH
clustering order by (timepoint DESC);
insert into ddata (deviceID,day,timepoint,feeds) values (1000001,'2013-09-01','2013-09-01 00:00:04',{'temp':73,'humidity':99});
deviceid | day | timepoint | event | feeds
----------+--------------------------+--------------------------+------------+----------------------------------------------------------
1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:04-0700 | null | {'humidity': 97, 'temp': 93}
1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:03-0700 | null | {'temp': 92}
1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:02-0700 | null | {'temp': 91}
1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:01-0700 | null | {'temp': 90}
人们对这两种选择有何看法?
- 从我所见,第一个选项将允许更好地键入不同的数据类型(int 与 float),但会使表格有点难看。
- 如果我避免使用集合类型,性能会更好吗?
随着新传感器数据类型的添加而不断添加额外的列有什么需要担心的吗?
我还应该考虑哪些其他因素?
- 对于这种情况,人们还有哪些其他数据建模想法?
谢谢,克里斯