node.js - URL 表示

Question

我想知道如何有效地将网站 URL 存储在数据库中（在我的例子中是 mongoDB）......

问题：它应该被索引以实现快速查询性能，但 mongo 只允许对小于 1024 字节的字段进行索引。

我考虑过散列或 base64 来缩小 URL……但由于我使用单线程网络服务器（node.js），我不想在上面做繁重的工作……

关于实现这一目标的其他方法是否有任何好主意（替代表示应该是唯一的......）？

score 4 · Accepted Answer

这个问题在 10gen 的 MongoDB 培训期间出现，并且散列被作为可行的解决方案提出。为 URL 生成 MD5 散列不应该是计算密集型的。我绝对不建议使用 base64 编码，因为那只会扩展 URL 字符串。

目标是创建一个具有高基数的索引，但这并不意味着哈希值必须是唯一的。如果您在查询中同时包含哈希和 URL，您将利用高度选择性的哈希索引，然后 MongoDB 将在索引命中中匹配 URL。在以下示例中，我们假设两个 URL 都存在哈希冲突：

$ mongo --quiet
> db.urls.insert({_id: 1, url: "http://google.com", hash: "c7b920f"});
> db.urls.insert({_id: 2, url: "http://yahoo.com", hash: "c7b920f"});
> db.urls.find({hash: "c7b920f"})
{ "_id" : 1, "url" : "http://google.com", "hash" : "c7b920f" }
{ "_id" : 2, "url" : "http://yahoo.com", "hash" : "c7b920f" }

> db.urls.find({hash: "c7b920f", url: "http://google.com"})
{ "_id" : 1, "url" : "http://google.com", "hash" : "c7b920f" }

> db.urls.ensureIndex({hash: 1})
> db.urls.find({hash: "c7b920f", url: "http://google.com"}).explain()
{
    "cursor" : "BtreeCursor hash_1",
    "nscanned" : 2,
    "nscannedObjects" : 2,
    "n" : 1,
    "millis" : 0,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "isMultiKey" : false,
    "indexOnly" : false,
    "indexBounds" : {
        "hash" : [
            [
                "c7b920f",
                "c7b920f"
            ]
        ]
    },
    "server" : "localhost:27017"
}

我不确定您是否有额外的业务要求来保证整个集合中的 URL 唯一性，但上面的示例只是为了表明从查询的角度来看这不是必需的。当然，任何散列算法都会有一定的冲突机会，但您有比 MD5 更好的选择，仍然可以满足 1024 字节的限制。

node.js - URL 表示

1 回答 1

Related

Reference