61

对于一些背景知识 - 这个问题涉及在单个小型 EC2 实例上运行的项目,并且即将迁移到中型实例。主要组件是 Django、MySQL 和大量用 python 和 java 编写的自定义分析工具,它们完成了繁重的工作。同一台机器也在运行 Apache。

数据模型如下所示 - 大量实时数据来自各种联网传感器,理想情况下,我想建立一个长轮询方法,而不是当前每 15 分钟轮询的方法(限制为计算统计数据并写入数据库本身)。一旦数据进来,我将原始版本存储在 MySQL 中,让分析工具对这些数据松散,并将统计信息存储在另外几个表中。所有这些都是使用 Django 呈现的。

我需要的关系特征 -

  • [Cassandra API 中的 SliceRange 似乎对此感到满意]
  • 通过...分组
  • 多个表之间的多对多关系[Cassandra SuperColumns 似乎适合一对多]
  • Sphinx 在这方面给了我一个很好的全文引擎,所以这也是必要的。【在 Cassandra 上,Lucandra 项目似乎满足了这个需求】

我的主要问题是数据读取非常慢(写入也不那么热)。我现在不想在它上面投入大量资金和硬件,我更喜欢可以随时间轻松扩展的东西。从这个意义上说,垂直扩展 MySQL 并非易事(或便宜)。

所以本质上,在阅读了很多关于 NOSQL 并尝试了 MongoDB、Cassandra 和 Voldemort 之类的东西之后,我的问题是,

  • 在中型 EC2 实例上,我是否会通过转移到 Cassandra 之类的东西在读/写方面获得任何好处这篇文章(pdf) 似乎确实暗示了这一点。目前,我会说每分钟几百次写入将是常态。对于读取 - 由于数据每 5 分钟左右更改一次,因此缓存失效必须很快发生。在某些时候,它也应该能够处理大量并发用户。即使创建了索引,在 MySQL 对大型表进行一些连接时,应用程序的性能也会受到影响——大约 32k 行的东西需要一分钟以上的时间来呈现。(这也可能是 EC2 虚拟化 I/O 的产物)。表的大小约为 4-5 百万行,大约有 5 个这样的表。

  • 鉴于 CAP 定理和最终一致性,每个人都在谈论在多个节点上使用 Cassandra。但是,对于一个刚刚开始发展的项目,部署单节点 cassandra 服务器是否有意义?有什么注意事项吗?例如,它可以取代 MySQL 作为 Django 的后端吗?【推荐吗?】

  • 如果我确实转移了,我猜我将不得不重写应用程序的某些部分来做更多的“管理”,因为我必须进行多次查找来获取行。

  • Would it make any sense to just use MySQL as a key value store rather than a relational engine, and go with that? That way I could utilize a large number of stable APIs available, as well as a stable engine (and go relational as needed). (Brett Taylor's post from Friendfeed on this - http://bret.appspot.com/entry/how-friendfeed-uses-mysql)

Any insights from people who've done a shift would be greatly appreciated!

Thanks.

4

3 回答 3

38

Cassandra and the other distributed databases available today do not provide the kind of ad-hoc query support you are used to from sql. This is because you can't distribute queries with joins performantly, so the emphasis is on denormalization instead.

However, Cassandra 0.6 (beta officially out tomorrow, but you can build from the 0.6 branch yourself if you're impatient) supports Hadoop map/reduce for analytics, which actually sounds like a good fit for you.

Cassandra provides excellent support for adding new nodes painlessly, even to an initial group of one.

That said, at a few hundred writes/minute you're going to be fine on mysql for a long, long time. Cassandra is much better at being a key/value store (even better, key/columnfamily) but MySQL is much better at being a relational database. :)

There is no django support for Cassandra (or other nosql database) yet. They are talking about doing something for the next version after 1.2, but based on talking to django devs at pycon, nobody is really sure what that will look like yet.

于 2010-02-25T14:37:30.350 回答
19

If you're a relational database developer (as I am), I'd suggest/point out:

  • Get some experience working with Cassandra before you commit to its use on a production system... especially if that production system has a hard deadline for completion. Maybe use it as the backend for something unimportant first.
  • It's proving more challenging than I'd anticipated to do simple things that I take for granted about data manipulation using SQL engines. In particular, indexing data and sorting result sets is non-trivial.
  • Data modelling has proven challenging as well. As a relational database developer you come to the table with a lot of baggage... you need to be willing to learn how to model data very differently.

These things said, I strongly recommend building something in Cassandra. If you're like me, then doing so will challenge your understanding of data storage and make you rethink a relational-database-fits-all-situations outlook that I didn't even realize I held.

Some good resources I've found include:

于 2011-05-06T01:25:44.367 回答
1

The Django-cassandra is an early beta mode. Also Django didn't made for no-sql databases. The key in Django ORM is based on SQL (Django recommends to use PostgreSQL). If you need to use ONLY no-sql (you can mix sql and no-sql in same app) you need to risky use no-sql ORM (it significantly slower than traditional SQL orm or direct use of No-SQL storage). Or you'll need to completely full rewrite django ORM. But in this case i can't presume, why you need Django. Maybe you can use something else, like Tornado?

于 2013-01-11T13:36:58.203 回答