4

我正在寻找一种工具,它可以帮助我分析存储库中不同文件的磁盘空间需求。

在我的存储库中有更大的二进制文件,有几个修订版。

因此,例如,我对存储库中单个二进制文件的所有这些修订使用多少空间感兴趣。AFAIK 无法通过“list”命令轻松获得此信息,因为我不知道 svn 的 deltification 的工作效率如何。

或者哪些是使用最多磁盘空间的文件/文件夹(不仅在头版本中,而且在所有版本中)

任何想法?

4

2 回答 2

5

How much storage a node uses in Subversion is not as straightforward as it may seem. I'm going to talk about FSFS (and provide a hack of an answer for FSFS only) since that's almost certainly the filesystem implementation you're using. If you're using BDB things are a little different.

A node can use up storage 4 ways. The actual text or body of the node, properties, and by the nature of existing they use storage in the directory node noting their existence (directory nodes have a body that consists of a dictionary of their children and the representation of the child), and finally the overhead of the file system (when you commit to a file it bubbles up new representations of the directories up to the root, so in my opinion that use of storage should belong to the files that caused it to be needed to be stored).

The space taken by the file text and properties is relatively easy to come up with, the directory storage and the overhead and much harder. Yet, even for the relatively easy question of the file text, due to representation sharing, it's still slightly complicated. Representation sharing happens when two files are identical (the files could have the same name, or not it doesn't matter, the only thing that matters is their text is the same), we avoid storing it again.

The following one-liner should answer the file text question for a single file.

REPO=~/my-repo; FILE=/somebigfile; grep --recursive --no-filename --text --before-context 3 "cpath: $FILE" "$REPO/db/revs/"* | grep 'text:' | cut -d' ' -f 1-7 | sort -u | awk '{ DISK+=$4; if ($5 == 0) { FULL += $4 } else { FULL += $5 } } END { print DISK, FULL, FULL-DISK}'

You'll need to change REPO to be set to the path to your repository and FILE to be the absolute path inside the repository to the file you want. This may not work perfectly since I may have forgotten some detail or another. But let me walk through how this works.

It greps every revision file for the the file you're looking for, asking for the preceding 3 lines as well as the match line. Then it removes everything except for the lines with text: on them (the lines detailing the text representation). We then exclude the last field (the uniqueifier; which is used to distinguish between shared representations). This allows us to limit it to unique representations we actually stored. We then sum the 5th and the 4th fields (which are the full text size and the representation size respectively). The full text size can be zero which means it's the same as the representation size (we stored the full text not a delta). Finally we print out the following fields: the size if we actually stored, the size of all versions of the file in full text, and finally the difference (negative number means we were less efficient than storing plaintext, positive means we saved that much space).

The fields of the text data are as follows:

revision offset_in_rev_file size_of_rep size_of_full_text md5 sha1 uniquifier

Older repositories may not have all of these fields, that's fine.

Because I'm depending on the text field to be within 3 lines of the cpath field in the rev file (hey this is a quick hack) it may not work perfectly. You may want to run the first two grep commands without all the rest and then look at the revisions provided (they'll be the first set of numbers from the left). Compare that with the outout of svn log for the file. If all the revs are there then it should be accurate.

If I find the time I'll try to writeup a utility that does this the right way (using the SVN libraries) and that is more useful. Probably will include the storage used by properties and maybe include some of the other storage I mentioned above.

TL;DR It's not an easy question to answer. Use the shell script above to answer the storage of a file text. It'll give you output that is the space we used on disk, the space of the full text of all revisions, and then how much we saved (negative means we lost space due to delta overhead).

于 2013-02-21T01:04:09.170 回答
1

可以转储存储库并过滤掉不需要的旧版本的二进制文件,然后将转储加载回同名的存储库。

您的工具/构建是什么样的?

要记住的另一件事 - 如果您曾经迁移到 git 或 hg,每次克隆时都会拉下这些二进制文件的整个历史记录......因此磁盘空间也成为客户端的问题。

于 2013-02-20T14:40:13.003 回答