13

I've been browsing the net trying to find a solution that will allow us to generate unique IDs in a regionally distributed environment.

I looked at the following options (among others):

SNOWFLAKE (by Twitter)

  • It seems like a great solutions, but I just don't like the added complexity of having to manage another software just to create IDs;
  • It lacks documentation at this stage, so I don't think it will be a good investment;
  • The nodes need to be able to communicate to one another using Zookeeper (what about latency / communication failure?)

UUID

  • Just look at it: 550e8400-e29b-41d4-a716-446655440000;
  • Its a 128 bit ID;
  • There has been some known collisions (depending on the version I guess) see this post.

AUTOINCREMENT IN RELATIONAL DATABASE LIKE MYSQL

  • This seems safe, but unfortunately, we are not using relational databases (scalability preferences);
  • We could deploy a MySQL server for this like what Flickr does, but again, this introduces another point of failure / bottleneck. Also added complexity.

AUTOINCREMENT IN A NON-RELATIONAL DATABASE LIKE COUCHBASE

  • This could work since we are using Couchbase as our database server, but;
  • This will not work when we have more than one clusters in different regions, latency issues, network failures: At some point, IDs will collide depending on the amount of traffic;

MY PROPOSED SOLUTION (this is what I need help with)

Lets say that we have clusters consisting of 10 Couchbase Nodes and 10 Application nodes in 5 different regions (Africa, Europe, Asia, America and Oceania). This is to ensure that content is served from a location closest to the user (to boost speed) and to ensure redundancy in case of disasters etc.

Now, the task is to generate IDs that wont collide when the replication (and balancing) occurs and I think this can be achieved in 3 steps:

Step 1

All regions will be assigned integer IDs (unique identifiers):

  • 1 - Africa;
  • 2 - America;
  • 3 - Asia;
  • 4 - Europe;
  • 5 - Ociania.

Step 2

Assign an ID to every Application node that is added to the cluster keeping in mind that there may be up to 99 999 servers in one cluster (even though I doubt: just as a safely precaution). This will look something like this (fake IPs):

  • 00001 - 192.187.22.14
  • 00002 - 164.254.58.22
  • 00003 - 142.77.22.45
  • and so forth.

Please note that all of these are in the same cluster, so that means you can have node 00001 per region.

Step 3

For every record inserted into the database, an incremented ID will be used to identify it, and this is how it will work:

Couchbase offers an increment feature that we can use to create IDs internally within the cluster. To ensure redundancy, 3 replicas will be created within the cluster. Since these are in the same place, I think it should be safe to assume that unless the whole cluster is down, one of the nodes responsible for this will be available, otherwise a number of replicas can be increased.

Bringing it all together

Say a user is signing up from Europe: The application node serving the request will grab the region code (4 in this case), get its own ID (say 00005) and then get an incremented ID (1) from Couchbase (from the same cluster).

We end up with 3 components: 4, 00005,1. Now, to create an ID from this, we can just join these components into 4.00005.1. To make it even better (I'm not too sure about this), we can concatenate (not add them up) the components to end up with: 4000051.

In code, this will look something like this:

$id = '4'.'00005'.'1';

NB: Not $id = 4+00005+1;.

Pros

  • IDs look better than UUIDs;
  • They seem unique enough. Even if a node in another region generated the same incremented ID and has the same node ID as the one above, we always have the region code to set them apart;
  • They can still be stored as integers (probably Big Unsigned integers);
  • It's all part of the architecture, no added complexities.

Cons

  • No sorting (or is there)?
  • This is where I need your input (most)

I know that every solution has flaws, and possibly more that what we see on the surface. Can you spot any issues with this whole approach?

Thank you in advance for your help :-)

EDIT

As @DaveRandom suggested, we can add the 4th step:

Step 4

We can just generate a random number and append it to the ID to prevent predictability. Effectively, you end up with something like this:

4000051357 instead of just 4000051.

4

2 回答 2

1

我认为这看起来很可靠。每个区域都保持一致性,如果您使用 XDCR,则不会发生冲突。INCR 在集群中是原子的,所以你不会有任何问题。您实际上不需要包含机器代码部分。如果一个区域内的所有应用服务器都连接到同一个集群,那么中缀它的 00001 部分是无关紧要的。如果出于其他原因(某种分析)这对您有用,那么无论如何,但这不是必需的。

所以它可以简单地是 '4' 。1'(使用您的示例)

你能给我一个你需要什么样的“排序”的例子吗?

第一:添加熵的一个缺点(我不确定你为什么需要它)是你不能轻松地遍历 ID 集合。

例如:如果您的 ID 为 1-100,您将通过对 Counter 键的简单 GET 查询知道这一点,您可以按组分配任务,此任务需要 1-10,接下来是 11-20,依此类推,工人可以并行执行。如果添加熵,则需要使用 Map/Reduce 视图来拉取集合,因此您将失去键值模式的好处。

第二:由于您关心可读性,因此添加文档/对象类型标识符也很有价值,这可以在 Map/Reduce 视图中使用(或者您可以使用 json 键来识别它)。

例如:'你:'。'4' 。'1'

如果您在外部引用 ID,您可能希望以其他方式隐藏。如果您需要一个示例,请告诉我,我可以在答案中附加您可以做的事情。

@scalabl3

于 2013-08-15T15:25:20.500 回答
1

您担心 ID 有两个原因:

  1. 复杂网络基础设施中的潜在冲突
  2. 外貌

从第二个问题开始,外观。虽然 UUID 就标识符而言当然不是一个很好的选择,但正如您提到的那样,当您在一个复杂的数据中心(或多个数据中心)中引入一个真正唯一的数字时,收益会递减。我不相信当使用长数字而不是 UUID 时,对应用程序的看法会发生巨大变化,例如在 Web 应用程序的 URL 中。理想情况下,两者都不会显示,并且ID只会通过 Ajax 请求等发送。虽然一个干净、令人难忘的 URL 更可取,但它从未阻止我在亚马逊购物(那里有绝对可怕的 URL)。:)

即使有你的提议,标识符,虽然它们的字符数比 UUID 短,但它们并不比 UUID 更容易记住。因此,外观可能仍然值得商榷。

谈到第一点......,是的,在某些情况下,已知 UUID 会产生冲突。虽然这不应该在正确配置和一致获得的架构中发生,但我可以看到它是如何发生的(但我个人不太关心它)。

因此,如果您在谈论替代方案,我已经成为 MongoDB 的简单性ObjectId及其在生成 ID 时避免重复的技术的粉丝。完整的文档在这里。快速相关的部分在几个方面与您的潜在设计相似:

ObjectId 是一个 12 字节的 BSON 类型,使用以下方法构造:

  • 一个 4 字节的值,表示自 Unix 纪元以来的秒数,
  • 一个 3 字节的机器标识符,
  • 一个 2 字节的进程 ID,以及
  • 一个 3 字节的计数器,从一个随机值开始。

时间戳通常可用于排序。机器标识符类似于具有唯一 ID 的应用程序服务器。进程 id 只是额外的熵,最后为了防止冲突,有一个计数器在时间戳与上次生成 ObjectId 的时间相同时自动递增(以便可以快速创建 ObjectIds)。ObjectIds 可以在客户端或数据库上生成。此外,ObjectId 占用的字节数确实比 UUID 少(但只有 4 个)。当然,您不能使用时间戳并丢弃 4 个字节。

为了澄清起见,我并不是建议您使用 MongoDB,而是从他们用于 ID 生成的技术中获得灵感。

所以,我认为你的解决方案是不错的(也许你想从 MongoDB 的唯一 ID 实现中得到启发)并且可行。至于你是否需要这样做,我认为这是一个只有你才能回答的问题。

于 2013-08-15T14:58:44.953 回答