供参考。我们正在使用 Cassandra 2.1.12.1047 | 运行此测试 DSE 4.8.4
我们在 Cassandra 中有一个简单的表,其中包含 5,000 行数据。前段时间,作为预防措施,我们在每个 Cassandra 实例上添加了监控以确保它有 5,000 行数据,因为我们的复制因子强制执行此操作,即我们在每个区域有 2 个副本,我们的开发集群中总共有 6 台服务器。
CREATE KEYSPACE example WITH replication = {'class': 'NetworkTopologyStrategy', 'ap-southeast-1-A': '2', 'eu-west-1-A': '2', 'us-east-1-A': '2'} AND durable_writes = true;
我们最近强行终止了一台服务器以模拟故障,并让一台新服务器上线,看看会发生什么。我们还删除了旧节点使用nodetool removenode
,以便在每个区域中,我们希望所有数据都存在于每个服务器上。
一旦新服务器上线,它就会加入集群,并且似乎开始复制数据。我们假设因为它处于引导模式,它将负责确保从集群中获取所需的数据。大约一个小时后 CPU 终于下降了,我们假设复制已完成。
但是,我们有意LOCAL_ONE
在每台服务器上使用查询的监视器报告说所有服务器都有 5,000 行,而上线的新服务器卡住了大约 2,600 行。我们假设它可能仍在复制,所以我们把它留了一段时间,但它保持在那个数字。
所以我们运行 nodetool status 来检查并得到以下信息:
$ nodetool status my_keyspace
Datacenter: ap-southeast-1-A
======================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 54.255.17.28 7.9 GB 256 100.0% a0c45f3f-8479-4046-b3c0-b2dd19f07b87 ap-southeast-1a
UN 54.255.64.1 8.2 GB 256 100.0% b91c5863-e1e1-4cb6-b9c1-0f24a33b4baf ap-southeast-1b
Datacenter: eu-west-1-A
=================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 176.34.207.151 8.51 GB 256 100.0% 30ff8d00-1ab6-4538-9c67-a49e9ad34672 eu-west-1b
UN 54.195.174.72 8.4 GB 256 100.0% f00dfb85-6099-40fa-9eaa-cf1dce2f0cd7 eu-west-1c
Datacenter: us-east-1-A
=================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 54.225.11.249 8.17 GB 256 100.0% 0e0adf3d-4666-4aa4-ada7-4716e7c49ace us-east-1e
UN 54.224.182.94 3.66 GB 256 100.0% 1f9c6bef-e479-49e8-a1ea-b1d0d68257c7 us-east-1d
因此,如果服务器报告它拥有 100% 的数据,为什么LOCAL_ONE
查询只给我们大约一半的数据?
当我运行LOCAL_QUORUM
查询时,它返回 5,000 行,从那时起,即使LOCAL_ONE
查询也返回 5,000。
虽然LOCAL_QUORUM
在这种情况下解决了问题,但我们将来可能需要执行其他类型的查询,假设每个服务器 a)拥有它应该拥有的数据,b)知道当它没有数据时如何满足查询,即它知道数据位于环上的其他位置。
24 小时后进一步更新 - 问题更严重
因此,在没有关于这个问题的任何反馈的情况下,我通过添加更多节点在集群上进行了实验。根据https://docs.datastax.com/en/cassandra/1.2/cassandra/operations/ops_add_node_to_cluster_t.html,我已按照建议将节点添加到集群的所有步骤,实际上是添加容量。我相信 Cassandra 的前提是,当您添加节点时,集群有责任重新平衡数据,并在此期间从环上的位置获取数据,如果它不在它应该在的位置。
不幸的是,事实并非如此。这是我的新戒指:
Datacenter: ap-southeast-1-A
======================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 54.255.xxx.xxx 8.06 GB 256 50.8% a0c45f3f-8479-4046-b3c0-b2dd19f07b87 ap-southeast-1a
UN 54.254.xxx.xxx 2.04 MB 256 49.2% e2e2fa97-80a0-4768-a2aa-2b63e2ab1577 ap-southeast-1a
UN 54.169.xxx.xxx 1.88 MB 256 47.4% bcfc2ff0-67ab-4e6e-9b18-77b87f6b3df3 ap-southeast-1b
UN 54.255.xxx.xxx 8.29 GB 256 52.6% b91c5863-e1e1-4cb6-b9c1-0f24a33b4baf ap-southeast-1b
Datacenter: eu-west-1-A
=================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 54.78.xxx.xxx 8.3 GB 256 49.9% 30ff8d00-1ab6-4538-9c67-a49e9ad34672 eu-west-1b
UN 54.195.xxx.xxx 8.54 GB 256 50.7% f00dfb85-6099-40fa-9eaa-cf1dce2f0cd7 eu-west-1c
UN 54.194.xxx.xxx 5.3 MB 256 49.3% 3789e2cc-032d-4b26-bff9-b2ee71ee41a0 eu-west-1c
UN 54.229.xxx.xxx 5.2 MB 256 50.1% 34811c15-de8f-4b12-98e7-0b4721e7ddfa eu-west-1b
Datacenter: us-east-1-A
=================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 54.152.xxx.xxx 5.27 MB 256 47.4% a562226a-c9f2-474f-9b86-46c3d2d3b212 us-east-1d
UN 54.225.xxx.xxx 8.32 GB 256 50.3% 0e0adf3d-4666-4aa4-ada7-4716e7c49ace us-east-1e
UN 52.91.xxx.xxx 5.28 MB 256 49.7% 524320ba-b8be-494a-a9ce-c44c90555c51 us-east-1e
UN 54.224.xxx.xxx 3.85 GB 256 52.6% 1f9c6bef-e479-49e8-a1ea-b1d0d68257c7 us-east-1d
正如您将看到的,我将环的大小翻了一番,并且有效所有权约为每台服务器的 50%(我的复制因子是每个区域 2 个副本)。但是,令人遗憾的是,您可以看到一些服务器上完全没有负载(它们是新的),而另一些服务器上的负载过大(它们很旧,显然没有发生数据分布)。
现在这本身并不令人担心,因为我相信 Cassandra 的强大功能及其最终将数据放在正确位置的能力。让我非常担心的是,我的表正好有 5,000 行,现在我的三个区域中的任何一个都不再有 5,000 行。
# From ap-southeast-1
cqlsh> CONSISTENCY ONE;
Consistency level set to ONE.
cqlsh> select count(*) from health_check_data_consistency;
count
-------
3891
cqlsh> CONSISTENCY LOCAL_QUORUM;
Consistency level set to LOCAL_QUORUM.
cqlsh> select count(*) from health_check_data_consistency;
count
-------
4633
# From eu-west-1
cqlsh> CONSISTENCY ONE;
Consistency level set to ONE.
cqlsh> select count(*) from health_check_data_consistency;
count
-------
1975
cqlsh> CONSISTENCY LOCAL_QUORUM;
Consistency level set to LOCAL_QUORUM.
cqlsh> select count(*) from health_check_data_consistency;
count
-------
4209
# From us-east-1
cqlsh> CONSISTENCY ONE;
Consistency level set to ONE.
cqlsh> select count(*) from health_check_data_consistency;
count
-------
4435
cqlsh> CONSISTENCY LOCAL_QUORUM;
Consistency level set to LOCAL_QUORUM.
cqlsh> select count(*) from health_check_data_consistency;
count
-------
4870
这么严重,这里发生了什么?让我们回顾一下:
- 我的复制因子是这样的
'ap-southeast-1-A': '2', 'eu-west-1-A': '2', 'us-east-1-A': '2'
,所以每个区域都应该能够完全满足查询。 - 引入新实例不应导致我丢失数据,但显然我们甚至使用 LOCAL_QUORUM
- 每个地区对数据都有不同的看法,但我没有介绍任何新数据,只有新的服务器会自动引导。
于是我想,为什么不QUORUM
跨整个多区域集群进行查询。不幸的是,这完全失败了:
cqlsh> CONSISTENCY QUORUM;
Consistency level set to QUORUM.
cqlsh> select count(*) from health_check_data_consistency;
OperationTimedOut: errors={}, last_host=172.17.0.2
然后我转身TRACING ON;
,也失败了。我可以在日志中看到以下内容:
INFO [SlabPoolCleaner] 2016-03-03 19:16:16,616 ColumnFamilyStore.java:1197 - Flushing largest CFS(Keyspace='system_traces', ColumnFamily='events') to free up room. Used total: 0.33/0.00, live: 0.33/0.00, flushing: 0.00/0.00, this: 0.02/0.02
INFO [SlabPoolCleaner] 2016-03-03 19:16:16,617 ColumnFamilyStore.java:905 - Enqueuing flush of events: 5624218 (2%) on-heap, 0 (0%) off-heap
INFO [MemtableFlushWriter:1126] 2016-03-03 19:16:16,617 Memtable.java:347 - Writing Memtable-events@732346653(1.102MiB serialized bytes, 25630 ops, 2%/0% of on/off-heap limit)
INFO [MemtableFlushWriter:1126] 2016-03-03 19:16:16,821 Memtable.java:382 - Completed flushing /var/lib/cassandra/data/system_traces/events/system_traces-events-tmp-ka-3-Data.db (298.327KiB) for commitlog position ReplayPosition(segmentId=1456854950580, position=28100666
)
INFO [ScheduledTasks:1] 2016-03-03 19:16:21,210 MessagingService.java:929 - _TRACE messages were dropped in last 5000 ms: 212 for internal timeout and 0 for cross node timeout
这发生在我运行查询的每台服务器上。
检查集群,似乎一切都在同步
$ nodetool describecluster;
Cluster Information:
Name: Ably
Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
51e57d47-8870-31ca-a2cd-3d854e449687: [54.78.xxx.xxx, 54.152.xxx.xxx, 54.254.xxx.xxx, 54.255.xxx.xxx, 54.195.xxx.xxx, 54.194.xxx.xxx, 54.225.xxx.xxx, 52.91.xxx.xxx, 54.229.xxx.xxx, 54.169.xxx.xxx, 54.224.xxx.xxx, 54.255.xxx.xxx]
进一步更新 1 小时后
有人建议,也许这只是因为范围查询没有按预期工作。因此,我编写了一个简单的脚本,分别查询 5k 行中的每一行(它们的 ID 范围为 1->5,000)。不幸的是,结果正如我所担心的那样,我丢失了数据。我已经用LOCAL_ONE
,LOCAL_QUORUM
和 event尝试过这个QUORUM
。
ruby> (1..5000).each { |id| put "#{id} missing" if session.execute("select id from health_check_data_consistency where id = #{id}", consistency: :local_quorum).length == 0 }
19 missing, 61 missing, 84 missing, 153 missing, 157 missing, 178 missing, 248 missing, 258 missing, 323 missing, 354 missing, 385 missing, 516 missing, 538 missing, 676 missing, 708 missing, 727 missing, 731 missing, 761 missing, 863 missing, 956 missing, 1006 missing, 1102 missing, 1121 missing, 1161 missing, 1369 missing, 1407 missing, 1412 missing, 1500 missing, 1529 missing, 1597 missing, 1861 missing, 1907 missing, 2005 missing, 2168 missing, 2207 missing, 2210 missing, 2275 missing, 2281 missing, 2379 missing, 2410 missing, 2469 missing, 2672 missing, 2726 missing, 2757 missing, 2815 missing, 2877 missing, 2967 missing, 3049 missing, 3070 missing, 3123 missing, 3161 missing, 3235 missing, 3343 missing, 3529 missing, 3533 missing, 3830 missing, 4016 missing, 4030 missing, 4084 missing, 4118 missing, 4217 missing, 4225 missing, 4260 missing, 4292 missing, 4313 missing, 4337 missing, 4399 missing, 4596 missing, 4632 missing, 4709 missing, 4786 missing, 4886 missing, 4934 missing, 4938 missing, 4942 missing, 5000 missing
从上面可以看出,这意味着我有大约 1.5% 的数据不再可用。
所以我很难过。我真的需要一些建议,因为我确实认为 Cassandra 是专门为处理按需水平扩展而设计的。非常感谢任何帮助。