TL;博士
提交总是作为完整文件快照创建,但垃圾收集会创建提交包,它使用差异压缩有效地存储相似的 blob,无论它们是否来自同一个文件。
介绍
我对 Git 存储“差异”而不是完整文件的理解都是错误的。在做了一些阅读和一些实验之后,我发现修改文件或创建文件副本并不重要,当你提交更改或新文件时,Git 每次都会创建一个全新的 blob。
但是,这非常低效,因为您最终会得到相同文本的许多不同副本,并且 blob 之间的差异很小。当 Git 创建包时,这个问题得到了解决。我不完全理解 Git 如何搜索要打包的东西,但是在一个包中,它会将一些 blob 存储为整个 blob,而其他一些 blob 将存储为与其他 blob 的差异。
实验
# create a big file and commit it
seq 1 1000000 | shuf > bigfile
git add bigfile
git commit -m'bigfile'
此时,find .git -ls
向我展示了一个存储这个 6.9MB 文件的大 blob (3.5MB)。
# modify the big file and commit the change
echo change >> bigfile
git commit -m'modify bigfile' bigfile
At this point, find .git -ls
shows me two big blobs, each about 3.5MB. Seems pretty inefficient to me, but read on...
# Add another big file, similar to the first one, and commit it
cp bigfile bigfile2
echo some trivial change >> bigfile2
git add bigfile2
git commit -m'bigfile2'
Things don't get better: find .git -ls
shows me three big blobs, each about 3.5MB!
Now, at some point when you push, Git might pack your sandbox, but we can force that to happen right now: run git gc
. That's not just garbage collection, as I incorrectly thought, it's also creating commit packs. After running git gc
, find .git -ls
now reports a single pack of about 3.2MB. So my three big blobs were identified as similar, better compressed, and stored efficiently. I think this is called "diff compression".
References
Online posts I just read to answer this question: