2

In our site, users can have many private files. We are thinking what could be the best distribution so as to avoid destroying the server's performance, These files are served through Apache and should be listed each time the user needs to manage them.

Our first approach right now is:

var first_level = (int) $user_id/100;
var files_folder = /uf/$first_level/$user_id

This gives us a first level of 100 folders and many second level folders. Since not all users have files and right now we are at about ~80k users, this means about 800 folders per second level folder.

What do you think about this approach?

4

3 回答 3

1

If your user ID values are fairly uniformly distributed and the number will continue to increase, then you should probably balance up the tree a bit more. What's best depends in part on where you think you'll end up in terms of numbers. Big directories are slower to search than small ones. While 800 files is not awful, it isn't great either. If you want to stick with 2 tiers and you have N users (as your target population), then you should aim for sqrt(N) folders in the first tier, with sqrt(N) folders in each second tier directory. For N = 80,000, that means about 300 folders per level. If you want to consider a 3 tier arrangement, replace the square root with the cube root. You might also find that using modulo arithmetic gives you a smoother distribution. That is, the first level might be better calculated as:

var first_level = (int) ($user_id % 300);

Assuming your unidentified language uses % for its modulo operator.

CPAN uses a system based on 3 tiers: first tier is the first letter of the user's login ID; the second tier is the first two letters, and the third tier is the full login ID.

I read somewhere that some site (university-based, IIRC) found that first and last letter of name gave a good system.

于 2009-01-03T02:03:50.723 回答
1

A popular scalable folder naming scheme if you don't care about readability is something like that squid uses: <4-bit>/<8-bit>/<remaining-116-bit-of-md5-of-whatever-lookup-key> or <whatever-unique-key-you-have>, so for user-id 1 the folder path can be /c4/ca42/1.

In this case, the first level is up to 16 directory and second level is up to 256 directories.

The big advantage of this approach is that the distribution of the folders are statistically uniform, regardless whether you have holes or clusters in your userids/usernames (smaller user ids tend to get unused due to attrition.)

于 2009-01-03T02:12:49.940 回答
1

您没有说使用什么文件系统来存储文件。您应该很容易创建具有您期望的实际负载特征的随机目录树。然后,您可以进行实验,告诉您正在考虑的各种策略的性能。

我无法轻松找到有关哪些文件系统使用高效数据结构(如大型目录的 B 树)的信息。我确实找到了 MacOS HFS 的说法。我会研究 XFS 或其他高性能的日志文件系统。

于 2009-01-03T02:39:26.063 回答