c++ - Writing data chunks while processing - is there a convergence value due to hardware constraints?

Question

I'm processing data from a hard disk from one large file (processing is fast and not a lot of overhead) and then have to write the results back (hundreds of thousands of files).

I started writing the results straight away in files, one at a time, which was the slowest option. I figured it gets a lot faster if I build a vector of a certain amount of the files and then write them all at once, then go back to processing while the hard disk is occupied in writing all that stuff that i poured into it (that at least seems to be what happens).

My question is, can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints ? To me it seems to be a hard disk buffer thing, I have 16MB buffer on that hard disk and get these values (all for ~100000 files):

Buffer size      time (minutes)
------------------------------
no Buffer        ~ 8:30
 1 MB            ~ 6:15
10 MB            ~ 5:45
50 MB            ~ 7:00

Or is this just a coincidence ?

I would also be interested in experience / rules of thumb about how writing performance is to be optimized in general, for example are larger hard disk blocks helpful, etc.

Edit:

Hardware is a pretty standard consumer drive (I'm a student, not a data center) WD 3,5 1TB/7200/16MB/USB2, HFS+ journalled, OS is MacOS 10.5. I'll soon give it a try on Ext3/Linux and internal disk rather than external).

score 4 · Accepted Answer

Can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints?

Not in the long term. The problem is that your write performance is going to depend heavily on at least four things:

Which filesystem you're using
What disk-scheduling algorithm the kernel is using
The hardware characteristics of your disk
The hardware interconnect you're using

For example, USB is slower than IDE, which is slower than SATA. It wouldn't surprise me if XFS were much faster than ext2 for writing many small files. And kernels change all the time. So there are just too many factors here to make simple predictions easy.

If I were you I'd take these two steps:

Split my program into multiple threads (or even processes) and use one thread to deliver system calls open, write, and close to the OS as quickly as possible. Bonus points if you can make the number of threads a run-time parameter.
Instead of trying to estimate performance from hardware characteristics, write a program that tries a bunch of alternatives and finds the fastest one for your particular combination of hardware and software on that day. Save the fastest alternative in a file or even compile it into your code. This strategy was pioneered by Matteo Frigo for FFTW and it is remarkably effective.

Then when you change your disk, your interconnect, your kernel, or your CPU, you can just re-run the configuration program and presto! Your code will be optimized for best performance.

score 2 · Accepted Answer

The important thing here is to get as many outstanding writes as possible, so the OS can optimize hard disk access. This means using async I/O, or using a task pool to actually write the new files to disk.

That being said, you should look at optimizing your read access. OS's (at least windows) is already really good at helping write access via buffering "under the hood", but if your reading in serial there isn't too much it can do to help. If use async I/O or (again) a task pool to process/read multiple parts of the file at once, you'll probably see increased perf.

score 1 · Accepted Answer

解析 XML 应该几乎可以以数十 MB/s 的磁盘读取速度进行。您的 SAX 实现可能不会这样做。

您可能想使用一些肮脏的技巧。使用普通 API 编写 100.000 多个文件不会有效率。

通过首先顺序写入单个文件来测试这一点，而不是 100.000。比较性能。如果差异很有趣，请继续阅读。

如果您真的了解您正在写入的文件系统，您可以确保您正在编写一个连续的块，您稍后会在目录结构中拆分为多个文件。

在这种情况下，您需要较小的块，而不是较大的块，因为您的文件会很小。块中的所有可用空间都将归零。

[编辑] 你真的对那些 100K 文件有外部需求吗？带有索引的单个文件就足够了。

score 0 · Accepted Answer

扩展 Norman 的回答：如果您的文件都进入一个文件系统，请仅使用一个辅助线程。

读取线程和写入助手之间的通信由std::vector每个助手的两个双缓冲区组成。（写入进程拥有一个缓冲区，读取进程拥有一个缓冲区。）读取线程填充缓冲区直到指定的限制然后阻塞。写入线程将写入速度与gettimeofday或其他时间相乘，并调整限制。如果写入速度比上次快，则将缓冲区增加 X%。如果速度变慢，调整 –X%。X 可以很小。

c++ - Writing data chunks while processing - is there a convergence value due to hardware constraints?

4 回答 4

Related

Reference