python - Cassandra 查询 - 无法执行此查询，因为它可能涉及数据过滤，因此可能具有不可预测的性能

Question

我有以下 Cassandra 模型：-

class Automobile(Model):
    manufacturer = columns.Text(primary_key=True)
    year = columns.Integer(index=True)
    model = columns.Text(index=True)
    price = columns.Decimal(index=True)

我需要以下查询：-

q = Automobile.objects.filter(manufacturer='Tesla')
q = Automobile.objects.filter(year='something')
q = Automobile.objects.filter(model='something')
q = Automobile.objects.filter(price='something')

这些都工作正常，直到我想要多列过滤，即当我尝试

q = Automobile.objects.filter(manufacturer='Tesla',year='2013')

它抛出一个错误说Cannot execute this query as it might involve data filtering and thus may have unpredictable performance.

我用重写了查询allowed_filtering，但这不是最佳解决方案。

然后在阅读更多内容后，我将模型编辑如下：-

class Automobile(Model):
    manufacturer = columns.Text(primary_key=True)
    year = columns.Integer(primary_key=True)
    model = columns.Text(primary_key=True)
    price = columns.Decimal()

有了这个，我也能够过滤多个库尔，没有任何警告。

当我这样做时DESCRIBE TABLE automobile，它显示这会创建复合键PRIMARY KEY ((manufacturer), year, model)。

所以，我的问题是，如果我将每个属性都声明为主键怎么办？这有什么问题吗，因为我也可以过滤多个列。

这只是一个小模型。如果我有一个模型，例如：-

class UserProfile(Model):
    id = columns.UUID(primary_key=True, default=uuid.uuid4)
    model = columns.Text()
    msisdn = columns.Text(index=True)
    gender = columns.Text(index=True)
    imei1 = columns.Set(columns.Text)
    circle = columns.Text(index=True)
    epoch = columns.DateTime(index=True)
    cellid = columns.Text(index=True)
    lacid = columns.Text(index=True)
    mcc = columns.Text(index=True)
    mnc = columns.Text(index=True)
    installed_apps = columns.Set(columns.Text)
    otp = columns.Text(index=True)
    regtype = columns.Text(index=True)
    ctype = columns.Text(index=True)
    operator = columns.Text(index=True)
    dob = columns.DateTime(index=True)
    jsonver = columns.Text(index=True)

如果我将每个属性都声明为 PK，这有什么问题吗？

score 13 · Accepted Answer

要理解这一点，您需要了解 cassandra 是如何存储数据的。主键中的第一个键称为分区键。它定义了行所属的分区。分区中的所有行都存储在一起，并一起复制。在分区内，行是根据集群键存储的。这些是 PK 中不是分区键的列。因此，如果您的 PK 是 (a, b, c, d)，则 a 定义分区。并且在特定分区（例如，a = a1）中，行存储按 b 排序。对于每个 b，行存储按 c 排序......等等。查询时，您点击一个（或几个分区），然后需要指定每个连续的集群键，直到您要查找的键。除了查询中指定的最后一个聚类列（可能是范围查询）之外，这些必须完全相等。

在前面的示例中，您可以这样做

where a = a1 and b > b1 where a = a1 and b=b1 and c>c1 where a = a1 and b=b1 and c=c1 and d > d1

但不能这样做：

where a=a1 and c=c1

为此，您需要“允许过滤”（实际上，您应该考虑更改模型或在那时进行非规范化）。

现在，关于使每一列都成为 PK 的一部分的问题。你可以这样做，但请记住，Cassandra 中的所有写入都是 upsert。行由它们的主键标识。如果您将每一列都作为 PK 的一部分，您将无法编辑一行。您不能更新主键中任何列的值。

score 4 · Accepted Answer

解决这个问题的正确方法是采用基于查询的建模方法。您应该使用四个（可能是三个）表和零个二级索引来解决这个问题，而不是一个具有三个二级索引的表。

你原来的表Automobile可能没问题。尽管我很想知道您的主键定义。但是，所以解决您的查询，Automobile.objects.filter(year='something')我会创建一个像这样的附加查询表（注意：在 CQL 中定义）：

CREATE TABLE automobileByYear (
  manufacturer text,
  year bigint,
  model text,
  price decimal,
  PRIMARY KEY ((year),manufacturer,model));

假设您还在 Python 端为此模型创建了一个相应的类 ( AutomobileByYear)，那么您可以提供如下查询：

AutomobileByYear.objects.filter(year='2013')

此外，拥有manufacturer作为您的第一个集群键也将允许此查询运行：

AutomobileByYear.objects.filter(manufacturer='Tesla',year='2013')

同样，为了通过模型解决您的查询，我将创建一个额外的查询表 ( automobileByModel)，该表的 PRIMARY KEY 定义重新排序如下：

PRIMARY KEY ((model),manufacturer,year));

您的集群键 (manufacturer和year) 的顺序会因您的查询要求而异，但关键是model在这种情况下它应该是您的分区键。

编辑

...但是应该这样我应该根据我的查询设计我的表，从而有很多数据冗余。假设，我有同样的汽车模型，有 N 个字段，假设 N=10。如果我想按每个 N 字段过滤。我应该为每个不同的过滤器类型查询创建一个不同的模型吗？

在这个时代，磁盘比以前便宜得多。话虽这么说，但我知道在一个问题上投入更多磁盘并不总是那么容易。我看到的更大问题是调整应用程序的 DAO 层以保持 10 个表同步。

在这种情况下，我建议与 Elastic 或 Solr 等搜索工具集成。事实上，Cassandra 的企业版与 Solr 开箱即用地集成。如果您确实需要在 10 多列上运行查询，那么强大的搜索工具会很好地补充您的 Cassandra 集群。

python - Cassandra 查询 - 无法执行此查询，因为它可能涉及数据过滤，因此可能具有不可预测的性能

2 回答 2

Related

Reference