distributed-computing - Why can't direct routing be used for distributed data with a secondary index?

Question

I'm reading the following article: Elements of Scale: Composing and Scaling Data platforms

I'm stuck on understanding the following sentences:

A secondary index is an index that isn’t on the primary key. This means the data will not be partitioned by the values in the index. Directed routing via a hash function is no longer an option. We have to broadcast requests to all machines.

Can anyone explain why this is the case? I am a beginner to data platforms but have gotten so far and understand the article.

Specifically, why can't we look up values in the secondary index for their primary key, then look up their location via a hash function on that primary key? Why broadcast requests to all machines?

Thank you for your time

score 1 · Accepted Answer

对于他们提供的示例，数据已分布在 4 个节点上。每个节点都有一个二级索引，但仅用于该节点上的值。二级索引没有所有节点上的所有记录。所以想要搜索的调用者需要发送到所有节点。

例如只有 2 个节点

节点 1 有 (1,a) (2,a) (3,b)

节点 2 有 (100,a) (105,c)

节点 1 的主索引为 1,2,3。还有一个二级索引 a,a,b

节点 2 的主索引为 100,105。还有一个二级索引 a,c

想要搜索“c”的调用者需要向两个节点广播以搜索两个二级索引。

但是，如果您在某处维护二级索引 a、a、a、b、c 的完整副本，则可以查询该索引，然后直接转到目标节点。但这在实践中比你想象的要复杂得多。

编辑 6 月 22 日。当您尝试在第三个节点上维护二级索引时，您需要考虑以下复杂情况。

插入/编辑操作现在涉及 2 个甚至 3 个节点，因此您需要实现某种两阶段提交协议或替代方案。
随着涉及的节点越多，您可能会发现整体可靠性会随着 MTBF 的降低而降低。
您需要考虑网络分区会发生什么。
维护操作可能更难。例如，如何在不产生过多网络流量的情况下有效地验证索引是否正确。
更新将如何编辑索引节点？是客户对此负责，还是主存储节点更新索引节点？

了解更多信息的好地方是回顾 CAP 定理，研究 2 阶段提交方案，并可能查看在分布式系统期刊上发表的一些 IEEE 论文。

score 0 · Accepted Answer

以 Cassandra 为例，数据写入到由分区键（在表模式中定义，通常是主键的第一部分）的哈希确定的节点的副本中。

二级索引是不在该分区键中的数据，假设索引写入保存原始数据的同一节点，在查询二级索引时，您无法确定包含该索引中特定值的数据的节点散列新“键”的值，因为它位于原始分区键（主数据）的节点上。

distributed-computing - Why can't direct routing be used for distributed data with a secondary index?

2 回答 2

Related

Reference