3

我有一个项目,它扫描一个大文件(2.5GB)挑选字符串,然后将其写入数百个文件的某个子集。

仅使用普通缓冲写入是最快的,但是

  1. 我担心文件句柄用完。
  2. 我希望能够在写入文件时查看文件的进度。
  3. 如果过程中断,我希望损失尽可能少。不完整的文件仍然部分有用。

因此,我以读/写模式打开,追加新行,然后再次关闭。

这在很多时候已经足够快了,但我发现在某些操作系统上这种行为是一种严重的悲观。上次我在 Windows 7 上网本上运行它时,我在几天后中断了它!

我可以实现某种 MRU 文件句柄管理器,它可以保持这么多文件打开并在每个文件进行这么多写操作后刷新。但这是否矫枉过正?

这一定是常见的情况,有没有“最佳实践”、“模式”?

当前的实现是在 Perl 中,并已在 Linux、Solaris 和 Windows、上网本到 phat 服务器上运行。但我对一般问题感兴趣:语言无关和跨平台。我曾想过用 C 或 node.js 编写下一个版本。

4

1 回答 1

2

On Linux, you can open a lot of files (thousands). You can limit the number of opened handles in a single process with the setrlimit syscall and the ulimit shell builtin. You can query them with the getrlimit syscall and also using /proc/self/limits (or /proc/1234/limits for process of pid 1234). The maximum number of system-wide opened files is thru /proc/sys/fs/file-max (on my system, I have 1623114).

So on Linux you could not bother, and open many files at once.

And I would suggest to maintain a memoized cache of opened files, and use them if possible (in a MRU policy). Don't open and close each file too often, only when some limit has been reached... (e.g. when an open did fail).

In other words, you could have your own file abstraction (or just a struct) which knows the file name, may have an opened FILE* (or a null pointer) and keep the current offset, maybe also the last time of opening or writing, then manage a collection of such things in a FIFO discipline (for those having an opened FILE*). You certainly want to avoid close-ing (and later re-open-ing) a file descriptor too often.

You might occasionally (i.e. once a few minutes) call sync(2), but don't call it too often (certainly not more than once per 10 seconds). If using buffered FILE-s don't forget to sometimes fflush them. Again, don't do that very often.

于 2012-09-06T21:29:18.867 回答