0

I have the following schema for posts. Each post has an embedded author and attachments (array of links / videos / photos etc).

{
    "content": "Pixable tempts Everpix users with quick-import tool for photos ahead of December 15 closure http:\/\/t.co\/tbsSrVYneK by @psawers",
    "author": {
        "username": "TheNextWeb",
        "id": "10876852",
        "name": "The Next Web",
        "photo": "https:\/\/pbs.twimg.com\/profile_images\/378800000147133877\/895fa7d3daeed8d32b7c089d9b3e976e_bigger.png",
        "url": "https:\/\/twitter.com\/account\/redirect_by_id?id=10876852",
        "description": "",
        "serviceName": "twitter"
    },
    "attachments": [
      {
        "title": "Pixable tempts Everpix users with quick-import tool for photos ahead of December 15 closure",
        "description": "Pixable, the SingTel-owned company that organizes your social photos in smart ways, has announced a quick-import tool for Everpix users following the company's decision to close ...",
        "url": "http:\/\/t.co\/tbsSrVYneK",
        "type": "link",
        "photo": "http:\/\/cdn1.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2013\/09\/camera1-.jpg"
      }
    ]
}

Posts are read often (we have a view with 4 tabs, each tab requires 24 posts to be shown). Currently we are indexing these lists in Redis, so querying 4x24posts is as simple as fetching the lists from Redis (returns a list of mongo ids) and querying posts with the ids.

Updates on the embedded author happen rarely (for example when the author changes his picture). The updates do not have to be instantaneous or even fast.

We're wondering if we should split up the author and the post into two different collections. So a post would have a reference to its author, instead of an embedded / duplicated author. Is a normalized data state preferred here (author is duplicated for every post, resulting in a lot of duplicated data / extra bytes)? Or should we continue with the de-normalized state?

4

1 回答 1

1

由于您的读取次数似乎比写入次数多几个数量级,因此将这些数据分成两个集合可能没有什么意义。特别是在更新很少的情况下,您在显示帖子时需要几乎所有作者信息,一个查询将比两个查询快。您还可以获得数据局部性,因此您可能也需要更少的内存数据,这应该提供另一个好处。

但是,您只能通过将其与您将在生产中使用的数据量进行基准测试来真正找出答案。

于 2013-11-26T11:42:26.763 回答