nosql - 列族概念和数据模型

Question

我正在研究不同类型的 NoSQL 数据库类型，并试图围绕列族存储的数据模型，例如 Bigtable、HBase 和 Cassandra。

第一个模型

有些人将列族描述为行的集合，其中每一行包含列^{[ 1 ]、[ 2 ]}。此模型的一个示例（列族为大写）：

{
  "USER":
  {
    "codinghorror": { "name": "Jeff", "blog": "http://codinghorror.com/" },
    "jonskeet": { "name": "Jon Skeet", "email": "jskeet@site.com" }
  },
  "BOOKMARK":
  {
    "codinghorror":
    {
      "http://codinghorror.com/": "My awesome blog",
      "http://unicorns.com/": "Weaponized ponies"
    },
    "jonskeet":
    {
      "http://msmvps.com/blogs/jon_skeet/": "Coding Blog",
      "http://manning.com/skeet2/": "C# in Depth, Second Edition"
    }
  }
}

第二个模型

其他网站将列族描述为一行 ^{[ 3 ]、[ 4 ]}中的一组相关列。上一个示例中的数据，以这种方式建模：

{
  "codinghorror":
  {
    "USER": { "name": "Jeff", "blog": "http://codinghorror.com/" },
    "BOOKMARK":
    {
      "http://codinghorror.com/": "My awesome blog",
      "http://unicorns.com/": "Weaponized ponies"
    }
  },
  "jonskeet":
  {
    "USER": { "name": "Jon Skeet", "email": "jskeet@site.com" },
    "BOOKMARK":
    {
      "http://msmvps.com/blogs/jon_skeet/": "Coding Blog",
      "http://manning.com/skeet2/": "C# in Depth, Second Edition"
    }
  }
}

第一个模型背后的一个可能理由是，并非所有列族都具有 likeUSER和BOOKMARKdo 的关系。这意味着并非所有列族都包含相同的键。从这个角度来看，将列族放置在外层感觉更自然。

“列族”这个名称意味着一组列。这正是第二个模型中列族的呈现方式。

两种模型都是数据的有效表示。我意识到这些表示仅用于将数据传达给人类；应用程序不会以这种方式“思考”数据。

问题

列族的“标准”定义是什么？它是行的集合，还是一行中的一组相关列？

我必须写一篇关于这个主题的论文，所以我也对人们通常如何向其他人解释“列族”概念感兴趣。这两种模式似乎相互矛盾。我想使用“正确”或普遍接受的模型来描述列族商店。

更新

我已经决定使用第二种模型来解释我论文中的数据模型。我仍然对您如何向其他人解释列族商店的数据模型感兴趣。

score 16 · Accepted Answer

The Cassandra database follows your first model, I think. A ColumnFamily is a collection of rows, which can contain any columns, in a sparse fashion (so each row can have different collection of column names, if desired). The number of columns allowed in a row is almost unlimited (2 billion in Cassandra v0.7).

A key point is that row keys must be unique within a column family, by definition - but can be re-used in other column families. So you can store unrelated data about the same key in different ColumnFamilies.

In Cassandra this matters because the data in a particular column family is stored in the same files on disk - so it is more efficient to place data items that are likely to be retrieved together, in the same ColumnFamily. This is partly a practical speed concern, but also a matter of organising your data into a clear schema. This touches upon your second definition - one might consider all the data about a particular key to be a "row", but partitioned by Column Family. However, in Cassandra it is not really a single row, because the data in one ColumnFamily can be changed independently of the data in other ColumnFamilies for the same row key.

score 10 · Accepted Answer

您描述的两种型号都是相同的。

列族是：

Key -> Key -> (Set of key/value pairs)

从概念上讲，它变成：

Table -> Row -> (Column1/Value1, Column2/Value2, ...)

将其视为键/值对映射的映射。

UserProfile = {
    Cassandra = [emailAddress:"cassandra@apache.org", age:20],
    TerryCho = [emailAddress:"terry.cho@apache.org", gender:"male"],
    Cath = [emailAddress:"cath@apache.org", age:20, gender:"female", address:"Seoul"],
}

上面是一个列族的例子。如果你要将它制成表格，你会得到一个名为 UserProfile 的表，它看起来像：

UserName | Email | Age | Gender | Address
Cassandra | cassandra@apache.org | 20 | null | null
TerryCho | terry.cho@apache.org | null | male | null
Cath | cath@apache.org | 20 | female | Seoul

令人困惑的部分是，实际上并没有我们习惯认为的一列或一行。有一堆按名称（键）查询的“列族”。这些族包含一组键/值对，也可以按名称（行键）查询，最后，集合中的每个值也可以按名称（列键）查找。

如果您需要表格参考点，“列族”将是您的“表”。其中的每个“k/v 对”都是您的“行”。每个“集合对”将是“列名及其值”。

在内部，每一列族内的数据将被存储在一起，并且将被存储为使行一个接一个地存储，并且在每一行中，列一个接一个地存储。所以你得到row1 -> col1/val1, col2/val2, ... , row2 -> col1/val1 ... , ... -> .... 所以从这个意义上说，数据的存储更像是行存储，而不是列存储。

最后，这里的措辞令人遗憾和误导。列族中的列应该被称为属性。行应该被称为属性集。列族应该被称为属性族。与经典表格词汇的关系很弱且具有误导性，因为它实际上非常不同。

score 2 · Accepted Answer

据我了解，Cassandra ColumnFamily 不是行的集合，而是列的集群。基于聚类键将列聚类在一起。例如，让我们考虑下面的列族：

CREATE TABLE store (
  enrollmentId int,
  roleId int,
  name text,
  age int,
  occupation text,
  resume blob,
  PRIMARY KEY ((enrollmentId, roleId), name)
) ;


INSERT INTO store (enrollmentid, roleid, name, age, occupation, resume)
values (10293483, 01, 'John Smith', 26, 'Teacher', 0x7b22494d4549);

使用 cassandra-cli 获取插入的上述详细信息，它基于集群键很好地集群，在这个例子中“name = John Smith”是集群键。

RowKey: 10293483:1
=> (name=John Smith:, value=, timestamp=1415104618399000)
=> (name=John Smith:age, value=0000001a, timestamp=1415104618399000)
=> (name=John Smith:occupation, value=54656163686572, timestamp=1415104618399000)
=> (name=John Smith:resume, value=7b22494d4549, timestamp=1415104618399000)

nosql - 列族概念和数据模型

第一个模型

第二个模型

问题

更新

3 回答 3

Related

Reference