ruby-on-rails - Mongo DB，删除冗余数据，如何从集合中删除重复的唯一索引

Question

我有一个包含冗余数据的集合。

示例数据：

{
    unique_index : "1"
    other_field : "whatever1"
},
{
    unique_index : "2"
    other_field : "whatever2"
},
{
    unique_index : "1"
    other_field : "whatever1"
}

我运行了查询：（我必须使用allowDiskUse:true，因为有很多数据）

db.collection.aggregate([
    {
        $group: { 
            _id: "$unique_index", 
            count: { $sum: 1 }
        } 
    }, 
    { $match: { count: { $gte: 2 } } }
], { allowDiskUse: true })

我得到这个输出：（例如）

{ "_id" : "1", "count" : 2 }
.
.

现在的问题是我只想保留一个数据。我想删除所有冗余数据。请注意，它的数据很多，比如超过 100,000 条记录之类的。我正在寻找快速简便的解决方案（在 mongodb 或 RoR 中，因为我使用的是 Ruby on Rails），如果有人可以提供帮助，将不胜感激。

score 1 · Accepted Answer

如果您不关心_id，最简单的方法是将不同的文档选择到新集合中，然后重命名它：

db.collection.aggregate([
    {$group: {
        _id: "$unique_index", 
        other_field: {$first: "$other_field"}
    }},
    {$project: {
        _id: 0,
        unique_index: "$_id",
        other_field:1
    }},
    {$out: "new_collection"}
]);

db.new_collection.renameCollection("collection", true);

请记住，您需要恢复所有索引。也不renameCollection适用于分片集合。

ruby-on-rails - Mongo DB，删除冗余数据，如何从集合中删除重复的唯一索引

1 回答 1

Related

Reference