感谢其他回答者的洞察力。
从网络上的其他问题听起来,NTFS 可以处理大小,但 Windows 资源管理器和网络操作可能会在低得多的阈值下窒息。我运行了一个非常均匀的随机分布的模拟,类似于 SHA-1 为一组随机的 1,000,000 个“文件”生成的分布。
Windows Explorer 绝对不喜欢 4 的目录宽度,因为它很快就接近了该级别的最大值 (65536)。我将前两个目录长度调整为每个 3(最大 4096),并将剩余的 34 位放在第三级,以尝试平衡深度与每级目录过多的概率。这似乎允许 Windows 资源管理器处理浏览结构。
这是我的模拟:
const string Root = @"C:\_Sha1Buckets";
using (TextWriter writer = File.CreateText(@"C:\_Sha1Buckets.txt"))
{
// simulate a very even distribution like SHA-1 would produce
RandomNumberGenerator rand = RandomNumberGenerator.Create();
byte[] sha1 = new byte[20];
Stopwatch watch = Stopwatch.StartNew();
for (int i=0; i<1000000; i++)
{
// populate bytes with a fake SHA-1
rand.GetBytes(sha1);
// format bytes into hex string
string hash = FormatBytes(sha1);
// C:\_Sha1Buckets
StringBuilder builder = new StringBuilder(Root, 60);
// \012\345\6789abcdef0123456789abcdef01234567\
builder.Append(Path.DirectorySeparatorChar);
builder.Append(hash, 0, 3);
builder.Append(Path.DirectorySeparatorChar);
builder.Append(hash, 3, 3);
builder.Append(Path.DirectorySeparatorChar);
builder.Append(hash, 6, 34);
builder.Append(Path.DirectorySeparatorChar);
Directory.CreateDirectory(builder.ToString());
if (i % 5000 == 0)
{
// write out timings every five thousand files to see if changes
writer.WriteLine("{0}: {1}", i, watch.Elapsed);
Console.WriteLine("{0}: {1}", i, watch.Elapsed);
watch.Reset();
watch.Start();
}
}
watch.Reset();
Console.WriteLine("Press any key to delete the directory structure...");
Console.ReadLine();
watch.Start();
Directory.Delete(Root, true);
writer.WriteLine("Delete took {0}", watch.Elapsed);
Console.WriteLine("Delete took {0}", watch.Elapsed);
}
大约 5 万之后,模拟速度似乎有所放缓(每 5000 秒 15-20 秒),但仍保持在该速度。最后的删除在我的机器上花费了 30 多分钟!
对于 100 万个哈希,分布是这样计算的:
- 第一级有 4096 个文件夹
- 第二级平均有 250 个文件夹
- 第 3 级平均有 1 个文件夹
That is very manageable within Windows Explorer and doesn't seem to get too deep or wide. Obviously if the distribution weren't this even, then we could run into problems, but only at the third level. The first two levels are bounded at 4096. I suppose if the target set were larger, we could add an additional level and gain a lot of growth potential. For my application 1 million is a very reasonable upper bound.
Anyone have any thoughts on the validity of such a test for determining directory structure heuristics?