3

我们正在为我们的社交网络使用 Datastax Cassandra,我们正在设计/数据建模我们需要的表格,这让我们感到困惑,我们不知道如何设计一些表格,而且我们遇到了一些小问题!

正如我们对每个查询的理解,我们必须有不同的表,例如用户 A 关注用户 C 和 B。

现在,在 Cassandra 中,我们有一个表posts_by_user

user_id      |  post_id       |  text  |  created_on  |  deleted  |  view_count  

likes_count  |  comments_count  |  user_full_name

我们有一个根据用户关注者的表,我们将帖子的信息插入到表中,称为user_timeline当关注者用户访问第一个网页时,我们从user_timeline表中从数据库中获取帖子。

这是user_timeline表格:

follower_id      |      post_id      | user_id (who posted)  |  likes_count  |  

comments_count   |   location_name   |  user_full_name

首先,该数据建模对于关注基础(关注者、关注者)社交网络是否正确?

现在我们要计算一个帖子的点赞数,如您所见,我们在两个表( user_timeline, posts_by_user)中都有点赞数,并假设一个用户有 1000 个关注者,然后通过每个点赞操作,我们必须更新所有 1000 行user_timeline和 1 行posts_by_users; 这不合逻辑!

然后,我的第二个问题是应该如何?我的意思是(最喜欢的)表应该如何?

4

1 回答 1

5

Think of using posts_by_user as metadata for a post's information. This would allow you to house user_id, post_id, message_text, etc, but you would abstract the view_count, likes_count, and comments_count into a counter table. This would allow you to fetch either a post's metadata or counters as long as you had the post_id, but you would only have to update the counter_record once.

DSE Counter Documentation: https://docs.datastax.com/en/cql/3.1/cql/cql_using/use_counter_t.html

However,

The article below is a really good starting point in relation to data modeling for Cassandra. Namely, there are a few things to take into consideration when answering this question, many of which will depend on the internals of your system and how your queries are structured. The first two rules are stated as:

Rule 1: Spread Data Evenly Around the Cluster

Rule 2: Minimize the Number of Partitions Read

Taking a moment to consider the "user_timeline" table.

  1. user_id and created_on as a COMPOUND KEY* - This would be ideal if

    • You wanted to query for posts by a certain user and with the assumption that you would have a decent number of users. This would distribute records evenly, and your queries would only be hitting a partition at a time.
  2. user_id and a hash_prefix as a COMPOUND KEY* - This would be ideal if

    • You had a small number of users with a large number of posts, which would allow your data to be evenly spread across the cluster. However you run the risk of having to query across multiple partitions.
  3. follower_id and created_on as a COMPOUND KEY* - This would be ideal if

    • You wanted to query for posts being followed by a certain follower. The records would be distributed and you would minimize queries across partitions

These were 3 examples for 1 table, and the point I wanted to convey is to design your tables around the queries you want to execute. Also don't be afraid to duplicate your data across multiple tables that are setup to handle various queries, this is the way Cassandra was meant to be modeled. Take a bit to read the article below and watch the DataStax Academy Data Modeling Course, to familiarize yourself with the nuances. I also included an example schema below to cover the basic counter schema I was pointing out earlier.

* The reason for the compound key is due to the fact that your PRIMARY KEY has to be unique, otherwise an INSERT with an existing PRIMARY KEY will become an UPDATE.

http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling https://academy.datastax.com/courses

CREATE TABLE IF NOT EXISTS social_media.posts_by_user (
user_id uuid,
post_id uuid,
message_text text,
created_on timestamp,
deleted boolean,
user_full_name text,
PRIMARY KEY ((user_id, created_on))
);
CREATE TABLE IF NOT EXISTS social_media.user_timeline (
follower_id uuid,
post_id uuid,
user_id uuid,
location_name text,
user_full_name text,
created_on timestamp,
PRIMARY KEY ((user_id, created_on))
);
CREATE TABLE IF NOT EXISTS social_media.post_counts (
likes_count counter,
view_count counter,
comments_count counter,
post_id uuid,
PRIMARY KEY (post_id)
);
于 2016-06-14T17:57:52.347 回答