git - Git Repository Only Gets Bigger After Using BFG

Question

We are currently in the process of migrating our SVN repo to GIT (hosted at bitbucket). I used subgit to import all our branches/history into a bare repo i have locally on my (Windows) PC.

The repo is quite big (7.42 GB after the import) this is because it also contains information about SVN like revision numbers to provide a way to have a two way sync between Git and SVN (I'm only interested in a one way SVN to GIT).

I create a local clone of the imported bare repo and push all the branches to bitbucket. After a couple of hours (!) the repo was fully uploaded. BitBucket now gave me warnings about the repo size. I checked the size and it was 1.1GB. Thats not as big as the imported bare but still to big to have a fast repository.

After playing around with BFG i managed to remove soms large DLL/SQL export files using these commands on the bare repo (I only use the clone for pushing without all the svn-related refs):

java -jar bfg.jar --delete-files '{''specialized 2015''','''specialized,''insert-pcreeks''}.sql' --no-blob-protection

java -jar bfg.jar --delete-files 'Incara.*.dll' --no-blob-protection Incara.git

git reflog expire --expire=now --all && git gc --prune=now --aggressive

This took a while and afterwards the git_find_big.sh script did not show these large sql files anymore. But after pushing things back to bitbucket (as a new repo, not as a force push) it only got bigger (1.8GB)

Can you provide a possible explanation for this behavior?

I don't know if it matters but we used a non standard branch/tag model in svn. This resulted in branches like: /refs/heads/archive/some/path/to/branch. These branches seemed to work just fine and removing them also did not affect the size.

Next to these problems i noticed i had some XML files showing up in the git_find_big.sh output:

size,pack,SHA,location 12180,1011,56731c772febd7db11de5a66674fe6a1a9ec00a7 repository/frontend.xml 12074,1002,0cefaee608c06621adfa4a9120ed7ef651076c33 repository/frontend.xml 12073,1002,a1c36cf49ec736a7fc069dcc834b784ada4b6a06 repository/frontend.xml 12073,1002,1ba5bd92817347739d3fba375fc42641016a5c1d repository/frontend.xml 12073,1002,e9182762bfc5849bc6645fdd6358265c3930779f repository/frontend.xml 12073,1002,dff5733d67cb0306534ac41a4c55b3bbaa436a2e repository/frontend.xml 12072,1002,8ee628f645ce53d970c3cf9fdae8d2697224e64c repository/frontend.xml 12072,1002,1266dee72b33f7a05ca67488c485ea8afc323615 repository/frontend.xml

These files contain the frontend logic of the web platform we are using and are indeed quite big. But they should be treated as text right? Therefore I don't get why they show up as separate objects in the above output. Am i right this should not be happening?

The SVN import also resulted in some empty commits (for example when SVN creates or moves a branch it needs a new commit). I guess these can only be removed using filter-branch?

Sorry, I have a lot of questions! Could someone help me with this?

Thanks,

Piet

score 2 · Accepted Answer

我在对您的问题的评论中要求提供更多诊断信息，这需要对主要部分给出合理的答案，但至于您的次要问题（顺便提一下，Stackoverflow 鼓励您单独提问！），这里是一些指示：

在这些问题旁边，我注意到我在 git_find_big.sh 输出中显示了一些 XML 文件：[snip]

这些文件包含我们正在使用的 Web 平台的前端逻辑，并且确实很大。但是它们应该被视为文本对吗？因此，我不明白为什么它们在上述输出中显示为单独的对象。我是对的，这不应该发生吗？

Git 根据文件的内容（SHA 哈希）分配 id，就这一点而言，并不关心您的文件是否为文本 - 如果文件甚至略有不同，它们的 id 也不同，并且将是单独存储（Git可能会在后台进行增量压缩，但这不会阻止文件被定义为逻辑上独立的文件）。git_find_big.sh因此，在输出中多次出现同一文件的不同版本也就不足为奇了。

SVN 导入也导致了一些空提交（例如，当 SVN 创建或移动一个分支时，它需要一个新的提交）。我猜这些只能使用过滤器分支删除？

是的，BFG 并不是开箱即用的。但是，这是一项可以相当快地完成的任务filter-branch （即使使用起来很繁琐）。

score 1 · Accepted Answer

包大小增加问题（在运行 BFG 之后）对我来说重新浮出水面，最终是由于 git 版本 2.18 时代版本包问题。使用 2.19 的同事没有遇到此问题，我能够在 2.19 中找到错误修复说明。

git - Git Repository Only Gets Bigger After Using BFG

2 回答 2

Related

Reference