mongodb - Storage overhead of GridFS

Question

We have a MongoDB Cluster using GridFS. The fs.chunks table of gridfs is sharded over two replicasets. The usage of diskspace is very high. For 90GB of data we need more than 130GB of diskspace.

It seems like the fs.chunks table is needing the space. I did summarize the "length" field of fs.files showing the 90GB of space. The sum of the "size" field of both shards is 130GB. This is the real size of the payload data contained in the collection, right?

This means it has 40GB overhead? Is this correct? Where is it coming from? is it the BSON encoding? Is there a way to reduce overhead?

mongos> db.fs.chunks.stats()
{
    "sharded" : true,
    "ns" : "ub_datastore_preview.fs.chunks",
    "count" : 1012180,
    "numExtents" : 106,
    "size" : 140515231376,
    "storageSize" : 144448592944,
    "totalIndexSize" : 99869840,
    "indexSizes" : {
            "_id_" : 43103872,
            "files_id_1_n_1" : 56765968
    },
    "avgObjSize" : 138824.35078345748,
    "nindexes" : 2,
    "nchunks" : 2400,
    "shards" : {
            "ub_datastore_qa_group1" : {
                    "ns" : "ub_datastore_preview.fs.chunks",
                    "count" : 554087,
                    "size" : 69448405120,
                    "avgObjSize" : 125338.44887174758,
                    "storageSize" : 71364832800,
                    "numExtents" : 52,
                    "nindexes" : 2,
                    "lastExtentSize" : 2146426864,
                    "paddingFactor" : 1,
                    "systemFlags" : 1,
                    "userFlags" : 0,
                    "totalIndexSize" : 55269760,
                    "indexSizes" : {
                            "_id_" : 23808512,
                            "files_id_1_n_1" : 31461248
                    },
                    "ok" : 1
            },
            "ub_datastore_qa_group2" : {
                    "ns" : "ub_datastore_preview.fs.chunks",
                    "count" : 458093,
                    "size" : 71066826256,
                    "avgObjSize" : 155136.2414531547,
                    "storageSize" : 73083760144,
                    "numExtents" : 54,
                    "nindexes" : 2,
                    "lastExtentSize" : 2146426864,
                    "paddingFactor" : 1,
                    "systemFlags" : 1,
                    "userFlags" : 0,
                    "totalIndexSize" : 44600080,
                    "indexSizes" : {
                            "_id_" : 19295360,
                            "files_id_1_n_1" : 25304720
                    },
                    "ok" : 1
            }
    },
    "ok" : 1
}

score 2 · Accepted Answer

这是集合中包含的有效负载数据的实际大小，对吗？

是的。

这意味着它有 40GB 的开销？这个对吗？

有点。但它似乎异常大。

它来自哪里？是 BSON 编码吗？

不，数据的 BSON 编码没有那么多开销。但有时会添加元数据。

在 mongo 中，开销的主要来源通常是元数据，但如果您使用参考网格规范 - 它不应该那么大。

例如，在我们的存储中，我们有：

db.fs.files.aggregate([{$group: {_id: null, total: { $sum: "$length"}}}])
{
    "result" : [
        {
            "_id" : null,
            "total" : NumberLong("4631125908060")
        }
    ],
    "ok" : 1
}

和

db.fs.chunks.stats()
{
    "ns" : "grid_fs.fs.chunks",
    "count" : 26538434,
    "size" : NumberLong("4980751887148"),
    "avgObjSize" : 187680.70064526037,
    "storageSize" : NumberLong("4981961457440"),
    "numExtents" : 2342,
    "nindexes" : 2,
    "lastExtentSize" : 2146426864,
    "paddingFactor" : 1,
    "systemFlags" : 1,
    "userFlags" : 0,
    "totalIndexSize" : 2405207504,
    "indexSizes" : {
        "_id_" : 1024109408,
        "files_id_1_n_1" : 1381098096
    },
    "ok" : 1
}

因此，4.8 tb 数据的开销约为 300 gb。

score 0 · Accepted Answer

问题是来自 GridFS 的“孤立块”。GridFS 首先写入块然后是元数据，如果出现问题，已经写入的块将保留为“孤立块”并且必须手动清理。

score 0 · Accepted Answer

您节省了 90 GB 的数据，但它消耗了 130 GB 的磁盘空间。

这意味着大约 44% 的开销。

如本文所述， GridFS的存储开销约为 45%，在您的情况下几乎相同。

mongodb - Storage overhead of GridFS

3 回答 3

Related

Reference