1

I have a two machine cluster which is running Cassandra 1.2.6. I am using a keyspace which has a replication factor of 2. But my application demands me to write to both the replicas in parallel and also let the Cassandra do the replication and hoping that Cassandra does not duplicate the key/value on the replica nodes.

For example:

  • I have nodes Node1 and Node2. I have a keyspace which has replication factor 2 configured on it and a column family to push key/value pairs
  • I use a python client (pycassa) to write to the cluster.
  • A key, "KeyX", hashes to Node1 and Node2. (I find out which key hashes to which servers through the node tool command. (`$nodetool getendpoints KeyspaceName ColumnFamilyName KeyHexString`)
  • I use a client to write (KeyX, Value) concurrently to the nodes Node1 and Node2. (In the connection pool I give only the specific server name)
  • When writing, I wait for one write to succeed (to the master). (Consistency level ONE)
  • Now, I monitor through the `$nodetool status` command the amount of disk space that the cluster uses.
  • I write around 100 keys each having 2MB values.

Ideally this should store around 400MB on disk with some overhead for storing keys which should be marginal compared to the value sizes that I using.

Observations:

  • If I do not write to all the nodes that the key hashes to, Cassandra internally handles replication and the data size is around 400MB. (200MB on each node for 100 keys with 2MB value)
  • If I do write to all the nodes the key hashes to, Cassandra is writing more than the expected amount of data to the disk. It is as high as 15% more. In our tests Cassandra write ~460MB instead of 400MB.

My question is, is the behavior (15% overhead) expected? Is there any configuration that we need to tweak so that Cassandra properly handles concurrent writes to all the replicas.

Thanks!

4

1 回答 1

2

我能想到的 15% 的额外空间有两个可能的原因。

一个是因为有时一个副本会临时存储一个列的两个副本。如果您在 Cassandra 中在稍有不同的时间两次写入一列,则这两个副本可能会进入不同的内存表,因此最终会在磁盘上的不同 SSTable 中。稍后,当 SSTables 通过压缩过程合并时,旧值将被丢弃,从而释放空间。在您的测试中,您可以运行nodetool compact强制压缩并查看空间使用率是否下降。

另一个可能的原因取决于您在没有写入两个节点时如何进行测试。如果您在一致性级别 ONE 上执行此操作,则可能某些写入已被另一个副本删除,因此它还没有所有密钥。您可以通过运行确定它确实如此nodetool repair。因此,您第一次观察中使用的空间可能并不适合所有键。

您应该知道,以一致性级别 ONE 写入所有副本并不能保证每个副本都拥有一个副本。接收数据的节点不必存储它以返回写入成功,即使它是副本也是如此。它可能会过载(在您的工作负载中,这很可能是由于没有足够的 I/O 来写出数据)并放弃写入,同时成功地将其写入另一个副本。这将导致在您的第二次观察中使用更少的空间,但可能不会在您的测试中发生,因为它是相对少量的数据。

如果你需要保证你有两个副本,你应该写在一致性级别 ALL 并且只写一次。

于 2013-08-13T07:08:59.537 回答