hadoop - DSE 4 分析节点 ~ 它有数据吗？它应该有数据吗？

Question

我们一直想知道为什么我们的一个集群显示分析节点拥有数据。为了便于阅读，我编辑了 ips、令牌和主机 ID

% nodetool status

Datacenter: Cassandra
=====================
Status=Up/Down|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Owns   Host ID      Token         Rack
UN  172.32.x.x  46.83 GB   18.5%  someguid     0             rack1
UN  172.32.x.x  60.26 GB   33.3%  anotherguid  ranbignumber  rack1
UN  172.32.x.x  63.51 GB   14.8%  anothergui   ranbignumber  rack1
Datacenter: Analytics
=====================
Status=Up/Down|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Owns   Host ID   Token          Rack
UN  172.32.x.x  28.91 GB   0.0%   someguid  100            rack1
UN  172.32.x.a  30.41 GB   33.3%  someguid  ranbignumber   rack1
UN  172.32.x.x  17.46 GB   0.0%   someguid  ranbignumber   rack1

那么 ip 为 172.32.xa 的 Analytics 节点实际上是否拥有数据？如果是这样，我们需要备份它吗？退役节点还会将数据移回适当的节点吗？

这是我在数据中心分析中的上述 nodetool 状态中所指的节点：

UN  172.32.x.a  30.41 GB   33.3%  someguid  ranbignumber   rack1

再次提出问题（使用下面提供的答案更新）。

我们需要备份这个节点吗？答案：是的
这个节点应该有数据吗？答：可以，否则分析性能将受到影响。
如果它不应该有数据，nodetool 退役是否会将数据移回其他节点？答：没有复制策略驱动这个

这是更新

% nodetool status our_important_keyspace

Datacenter: Cassandra
=====================
Status Address     Load       Owns (effective)  
UN     2           63.16 GB   81.5%             
UN     1           47.21 GB   33.3%             
UN     3           59.87 GB   85.2%
Datacenter: Analytics
=====================
Status Address     Load       Owns (effective)
UN     3           17.74 GB   33.3%  
UN     2           30.62 GB   33.3%
UN     1           29.21 GB   33.3%

今天备份分析 - 很棒的答案，可能为我们节省了很多痛苦。

score 2 · Accepted Answer

The first thing you need to do is run nodetool status or dsetool ring using the keyspace that your data is stored in. This will show you the ownership as dictated by replication strategy of that keyspace. What you are looking at now i s most likely the ownership as set by the raw token values. If your keyspace was named "important_data" you would run "nodetool status important_data".

This replication strategy on your keyspace is key to determining what nodes are responsible for data in your cluster. In any case a multi DC cluster should be using a NetworkTopologyStrategy which allows specifying how many replicas of your data should live in each Datacenter. For example if you wanted to make sure the data was replicated twice in the Cassandra cluster but only once in the Analytics cluster you would use a network topology strategy like, {'Cassandra':2, 'Analytics':1 }. This would mean that every piece of data is replicated 3 times cluster wide. If you really wanted the data to not be copied to the analytics nodes (this would be detrimental to analytics performance) you could set 'Analytics:0' or omit that phrase all together.

Your backup-strategy should always backup at least a full replica of the data but it is most likely easiest to just backup every node or at least every node in one datacenter (as you could bootstrap the others off of it)

The node will only have data if you want it to via the Replication strategy and in this case you will need decommission when removing the node as you would with any node in the cluster. Most users do find it useful to have replicas in their analytics datacenters because this allows for faster access when using various analytics tools.

hadoop - DSE 4 分析节点 ~ 它有数据吗？它应该有数据吗？

1 回答 1

Related

Reference