64

有哪些需要避免的陷阱?您是否有任何交易中断?例如,我听说导出/导入 Cassandra 数据非常困难,这让我想知道这是否会妨碍将生产数据同步到开发环境。

顺便说一句,很难找到关于 Cassandra 的好的教程,我只有一个http://arin.me/code/wtf-is-a-supercolumn-cassandra-data-model仍然很基础。

谢谢。

4

5 回答 5

41

对我来说,主要是决定是使用 OrderedPartitioner 还是 RandomPartitioner。

如果您使用 RandomPartitioner,则无法进行范围扫描。这意味着您必须知道任何活动的确切密钥,包括清理旧数据。

因此,如果您有很多流失,除非您有某种神奇的方法可以准确地知道您为哪些键插入了东西,否则使用随机分区器很容易“丢失”东西,这会导致磁盘空间泄漏并最终会消耗所有存储空间。

另一方面,您可以询问有序分区器“我在 A 和 B 之间的列族 X 中有哪些键”?- 它会告诉你。然后,您可以清理它们。

但是,也有一个缺点。由于 Cassandra 不进行自动负载平衡,如果您使用有序分区器,那么您的所有数据很可能最终只会在一两个节点中,而在其他节点中则没有,这意味着您将浪费资源。

我对此没有任何简单的答案,除非在某些情况下,您可以通过在键的开头放置一个简短的哈希值(您可以从其他数据源轻松枚举的东西)来获得“两全其美” - 因为例如用户 ID 的 16 位十六进制哈希 - 这将为您提供 4 个十六进制数字,然后是您真正想要使用的任何密钥。

然后,如果您有一个最近删除的用户列表,您只需散列他们的 ID 并进行范围扫描以清理与他们相关的任何内容。

下一个棘手的问题是二级索引——Cassandra 没有任何索引——所以如果你需要按 Y 查找 X,你需要在两个键下插入数据,或者有一个指针。同样,当它们指向的东西不存在时,可能需要清理这些指针,但是在此基础上查询东西并不容易,因此您的应用程序需要 Just Remember。

并且应用程序错误可能会留下您忘记的孤立键,并且您无法轻松检测它们,除非您编写一些垃圾收集器定期扫描数据库中的每个键(这将需要一段时间 -但是您可以分块进行)以检查不再需要的那些。

这些都不是基于实际使用,只是我在研究中发现的。我们不在生产中使用 Cassandra。

编辑:Cassandra 现在在主干中有二级索引。

于 2009-10-03T06:26:31.557 回答
17

这太长了,无法添加为评论,因此要澄清问题列表回复中的一些误解:

  1. 任何客户端都可以连接到任何节点;如果您选择的第一个节点(或您通过负载均衡器连接)出现故障,只需连接到另一个节点。此外,还提供了一个“胖客户端”api,客户端可以自己引导写入;一个例子是http://wiki.apache.org/cassandra/ClientExamples

  2. 当服务器无响应而不是无限期挂起时超时是大多数处理过过载 rdbms 系统的人所希望的功能。Cassandra RPC 超时是可配置的;如果您愿意,您可以将其设置为几天并无限期地处理挂起。:)

  3. 确实,目前还没有多重删除或截断支持,但正在审查这两个方面的补丁。

  4. 显然,在集群节点之间保持负载平衡需要权衡:您尝试保持的平衡越完美,您将进行的数据移动就越多,这不是免费的。默认情况下,Cassandra 集群中的新节点将移动到令牌环中的最佳位置,以最大限度地减少不均匀性。在实践中,这已被证明效果很好,并且集群越大,加倍最优的说法就越不真实。这在http://wiki.apache.org/cassandra/Operations中有更多介绍

于 2009-12-17T15:16:51.967 回答
7

您是否有任何交易中断?不一定是交易破坏者,但需要注意的事情

  1. 客户端连接到最近的节点,它应该事先知道哪个地址,与通过它代理的所有其他 Cassandra 节点的所有通信。一个。读/写流量在节点之间分布不均匀 - 一些节点代理的数据比它们自己托管的数据多 b. 如果节点宕机,客户端就束手无策,无法读取,无法写入集群中的任何位置。

  2. 尽管 Cassandra 声称“写入永远不会失败”,但它们确实会失败,至少在说话的那一刻是这样。如果目标数据节点变得迟缓,请求超时并且写入失败。节点无响应的原因有很多:垃圾收集器启动、压缩过程等等……在所有这些情况下,所有写入/读取请求都会失败。在传统的数据库中,这些请求会相应地变慢,但在 Cassandra 中它们只是失败了。

  3. 有多重获取,但没有多重删除,也不能截断 ColumnFamily

  4. 如果一个新的空数据节点进入集群,将只传输来自密钥环上一个邻居节点的部分数据。这会导致数据分布不均,负载不均。您可以通过始终将节点数量加倍来修复它。还应该手动跟踪令牌并明智地选择它们。

于 2009-11-05T22:36:52.260 回答
7

另一个教程在这里:http ://blog.evanweaver.com/articles/2009/07/06/up-and-running-with-cassandra/ 。

于 2009-10-04T02:47:41.340 回答
5

I think this deserves an update since Cassandra 1.2 came out recently.

I have been using Cassandra in production for the past 18 month on for social games.

My though is that you have to use Cassandra for its strengths. So a good understanding of what and how it does it is necessary to see which data model to use or even to identify if another DB solution is more useful for you.

OrderedPartitioner is useful only if your application rely on key range queries, BUT you give up on one of the most powerful features of Cassandra for that: automatic sharding and load balancing. Instead of row key range queries try to implement the same functionality you need using ranges of columns names within the same row. TL;DR read/write WILL NOT be balanced between nodes using this.

RandomPartioner (md5 hashing) and MurmurPartitioner (Murmur hashing, better and faster) are the way YOU MUST go if you want to support big data and a high access frequencies. The only thing you give up on is key range queries. Everything that is in the same row is still on the same node in the cluster and you can use the comparator and column name range queries on those. TL;DR : USE THIS for PROPER BALANCING, you will give up nothing major.


Things you should know about cassandra:

Cassandra is EVENTUALLY consistent. Cassandra has chosen to trade Consistency for high Availability and excellent Partitioning (http://en.wikipedia.org/wiki/CAP_theorem). BUT you can get consistency from cassandra, it is all about you Consistency policy when you read and write to it. This is quite an important and complex topic when talking about using cassandra but you can read about it in detail here http://www.datastax.com/docs/1.2/dml/data_consistency.

As a rule of thumb (and to keep it simple) I read and write at QUORUM ConsistencyLevel (since in my apps reads tend to be of the same order of frequency as writes). If your app is hugely write heavy and reads happen a lot less often then use write at ONE and read at ALL. Or if your use case is the opposite (writes are a lot less frequent than reads) then you can try read on ONE and write on ALL. Using ANY as a consistency level for writes is not a great idea if consistency is what you are trying to solve, since it guarantees that the mutation has reached the cluster but not that it has been written anywhere. This is the only case in which I got writes to silently fail on cassandra.

Those are simple rules to make it easy to get started with cassandra development. To get as much consistency and performance as possible from a production cluster you should study this topic hard and really understand it yourself.

If you need a human readable datamodel with complex relations between Entities (tables) then I do not think Cassandra is for you. MySQL and maybe NewSQL might be more helpful for your use case.

A good thing to know is how, roughly, cassandra saves and read data. Whenever you write (deletes are actually writes of a "tombstone" value in cassandra) the system will put the new value and its time stamp in a new physical location.

When you read, cassandra tries to pull all the writes for a certain key/column_name location and returns you the most recent he could find (the one with the highest timestamp, which has been given by the client). So the memory needed by a node is directly dependent on the frequencies of writes. There is a compaction process in cassandra that takes care of cleaning old mutations. Cassandra has an internal cache that is updated on reads with the latest value of the location.

The merging/compaction on disk of the SSTables (the data structures that persist the data) can be provoked by reads, but it's better not to count on it. The cleaning of tombstones and expired columns (using the time-to-live functionality) is a different mechanism managed by the garbage collector (see the GC grace time setting for more details).


This brings me to the last point I want to make: Be sure that your writes and read will be balanced across your cluster!

Let's assume that all your users need to update a single location very frequently.
DO NOT map that theoretical single location to only one row key! This would make all your writes fall on only one node in your cluster. If it doesn't bring everything down (because you have rockstar sysops) it will at least heavily cripple the cluster's performance.
My advice is to bucket your writes in enough different row keys that you will distribute your writes across all nodes in the cluster. To retrieve all data for that single theoretical location use a multi_get on all the "sub row keys".

Example :
I want to have a list of all active http sessions (which have uuid assigned to them). Do not save all into one "session" row key. What I use as a row key for my cassandra cluster of 6 nodes is : _sessions. Then I have a small 16 keys multi_get to retrieve all active sessions, or I can still tell if a session is active by just using a simple get (if I know its uuid of course). If your cluster is a lot bigger you might want to use a hash function for generation bucket keys.

于 2013-04-04T14:33:39.460 回答