git - Git：如何处理具有共享历史记录的文件副本？

Question

我将我的 CSS 用户样式备份到 git 存储库，如下所示：

❯ fd                                                                                            
stylus-2021-05-18.json
stylus-2021-05-20.json

这些备份文件显然大多是相同的，即stylus-2021-05-18.json是过去的历史stylus-2021-05-20.json。这是如何处理的git？

显然，我可以将文件重命名为stylus.json并git完全处理版本控制，但我想知道git它是否足够聪明，可以自动处理这些文件。

score 3 · Accepted Answer

TL;博士

提交总是作为完整文件快照创建，但垃圾收集会创建提交包，它使用差异压缩有效地存储相似的 blob，无论它们是否来自同一个文件。

介绍

我对 Git 存储“差异”而不是完整文件的理解都是错误的。在做了一些阅读和一些实验之后，我发现修改文件或创建文件副本并不重要，当你提交更改或新文件时，Git 每次都会创建一个全新的 blob。

但是，这非常低效，因为您最终会得到相同文本的许多不同副本，并且 blob 之间的差异很小。当 Git 创建包时，这个问题得到了解决。我不完全理解 Git 如何搜索要打包的东西，但是在一个包中，它会将一些 blob 存储为整个 blob，而其他一些 blob 将存储为与其他 blob 的差异。

实验

# create a big file and commit it
seq 1 1000000 | shuf > bigfile
git add bigfile
git commit -m'bigfile'

此时，find .git -ls向我展示了一个存储这个 6.9MB 文件的大 blob (3.5MB)。

# modify the big file and commit the change
echo change >> bigfile
git commit -m'modify bigfile' bigfile

At this point, find .git -ls shows me two big blobs, each about 3.5MB. Seems pretty inefficient to me, but read on...

# Add another big file, similar to the first one, and commit it
cp bigfile bigfile2
echo some trivial change >> bigfile2
git add bigfile2
git commit -m'bigfile2'

Things don't get better: find .git -ls shows me three big blobs, each about 3.5MB!

Now, at some point when you push, Git might pack your sandbox, but we can force that to happen right now: run git gc. That's not just garbage collection, as I incorrectly thought, it's also creating commit packs. After running git gc, find .git -ls now reports a single pack of about 3.2MB. So my three big blobs were identified as similar, better compressed, and stored efficiently. I think this is called "diff compression".

References

Online posts I just read to answer this question:

Commits and snapshots, not diffs, by Derrick Stolee (link found in @Joachim Sauer's answer)
Git Internals - How Git works, by Kaushik Rangadurai

score 0 · Accepted Answer

纯粹从技术角度来看很容易：如果 git 历史记录中的两个文件曾经具有完全相同的（逐字节）相同的内容，那么它们将引用相同的 blob 对象^*并且实际内容将只存储一次。因此，如果您当前的版本与 2 次提交之前fileA的版本相同fileB，那么它们仍然只会在.git子目录中存储一次。无论文件是否具有不同的名称、在同一个提交或另一个或不同的路径上，这都有效：只要内容相同，就会重用 blob。

另一方面：如果这种情况发生得太频繁，那么这表明您正在以一种并非真正要使用的方式使用版本控制：给定的提交不应包含任何“历史数据”或“存档”：这就是其他提交/标签/分支的用途。任何给定分支的 HEAD 都应准确（且仅）包含当前与该分支相关的内容。但这部分在技术上并不是必需的：它只是关于 git 通常如何使用的约定。

^{* 请注意，这种重用甚至会达到目录级别，即如果两个目录包含相同的子目录和文件，它们将引用相同的树对象。这使得存储“非常相似”的提交非常有效：实际上只有差异必须另外存储。请注意，提交仍然是快照而不是 diffs。}

git - Git：如何处理具有共享历史记录的文件副本？

2 回答 2

TL;博士

介绍

实验

References

Related

Reference