indexing - 如何有效地对 cassandra 中的两列进行范围查询？

Question

我想将数百万个位置保存到 Cassandra 的 ColumnFamily 中，而不是对这些数据进行范围查询。

例如：

属性：LocationName, latitude, longitude 查询：SELECT LocationName FROM ColumnFamily WHERE latitute> 10 AND latitude<20 AND longitude>30 AND longitude<40;

我应该使用什么结构和索引以便查询高效？

score 0 · Accepted Answer

Let's pretend you are going to grow into the billions(and I will do the millions case later below). If you were using something like PlayOrm on cassandra(or you can do this yourself instead of using PlayOrm), you would need to partition by something. Let's say you choose to partition by longitude so that anything between >= 20 and < 30 is in partition 20 and between >= 30 and < 40 is in partition 30. Then in PlayOrm, you use it's scalable SQL to just write the same query you wrote but you need to query the proper partitions which in some cases would be multiple partitions unless you limit your result set size...

In PlayOrm, or in your data model, it would look like (no other tables needed)

Location {
  key: [locationID] {
    LonBottom: [partitionKey]
    Lat: [exact_lat] <- @NoSqlIndexed
    Lon: [exact_lon] <- @NoSqlIndexed
    ...
  }
  ...
}

That said, if you are in the millions, you would not need partitions so just remove the LonBottom column above and do no partitioning....of course, why use noSQL as millions is not that big and an RDBMS can easily handle millions.

If you want to do it yourself, in the millions case, there are two rows for Lat and Lon(wide row pattern) that hold the indexed values of lat and long to query. For billinos case, it would be two rows per partition as each partition gets it's own index as you don't want indices that are too large.

An indexing row is simple for you to create. It is simply rowkey="index name" and each column name is a compound name of longitude and row key to location. There is NO value for each column, just a compound name (so that each col name is unique).

so your row might look like

longindex = 32.rowkey1, 32.rowkey45, 32.rowkey56, 33.rowkey87, 33.rowkey89

where 32 and 33 are longitudes and the rowkeys are pointing to the locations.

score 0 · Accepted Answer

根据您在查询中所需的粒度（以及该粒度的可变性），处理此问题的一种方法是将您的地图分割成一个网格，其中您的所有位置都属于具有定义的纬度/经度边界的网格正方形内盒子。然后，您可以对网格正方形 ID 进行初始查询，然后是这些正方形内的位置，其表示形式如下：

GridSquareLat {
  key: [very_coarse_lat_value] {
    [square_lat_boundary]:[GridSquareIDList]
    [square_lat_boundary]:[GridSquareIDList]
  }
  ...
}

GridSquareLon {
  key: [very_coarse_lon_value] {
    [square_lon_boundary]:[GridSquareIDList]
    [square_lon_boundary]:[GridSquareIDList]
  }
  ...
}

Location {
  key: [locationID] {
    GridSquareID: [GridSquareID]  <-- put a secondary index on this col
    Lat: [exact_lat]
    Lon: [exact_lon]
    ...
  }
  ...
}

然后，您可以为 Cassandra 提供表示非常粗粒度纬度/经度值的 GridSquareLat/Lon 键，以及将返回的列减少到仅在您的边界内的那些正方形的列切片范围。您将获得两个列表，一个用于 lat 的网格正方形 ID，一个用于 lon。这些列表的交集将是您范围内的网格方块。

要获取这些方格中的位置，请查询位置 CF，对 GridSquareID 进行过滤（使用二级索引，只要您的方格总数合理，这将是有效的）。您现在拥有一个大小合理的位置列表，其中只有几个非常有效的查询，您可以轻松地将它们减少到应用程序中的确切列表。

indexing - 如何有效地对 cassandra 中的两列进行范围查询？

2 回答 2

Related

Reference