4

I have a userlevel program which opens a file using the flags O_WRONLY|O_SYNC. The program creates 256 threads which attempt to write 256 or more bytes of data each to the file. I want to have a total of 1280000 requests, making it a total of about 300 MB of data. The program ends once 1280000 requests have been completed.

I use pthread_spin_trylock() to increment a variable which keeps track of the number of requests that have been completed. To ensure that each thread writes to a unique offset, I use pwrite() and calculate the offset as a function of the number of requests that have been written already. Hence, I don't use any mutex when actually writing to the file (does this approach ensure data integrity?)

When I check the average time for which the pwrite() call was blocked and the corresponding numbers (i.e., the average Q2C times -- which is the measure of the times for the complete life cycle of BIOs) as found using blktrace, I find that there is a significant difference. In fact, the average completion time for a given BIO is much greater than the average latency of a pwrite() call. What is the reason behind this discrepancy? Shouldn't these numbers be similar since O_SYNC ensures that the data is actually written to the physical medium before returning?

4

1 回答 1

3

pwrite()应该是原子的,所以你应该在那里安全......

关于系统调用和实际 BIO 之间的延迟差异,根据kernel.org的 open(2) 手册页上的以下信息:

POSIX 提供了三种不同的同步 I/O 变体,对应于标志 O_SYNC、O_DSYNC 和 O_RSYNC。目前(2.6.31),Linux 只实现了 O_SYNC,但 glibc 将 O_DSYNC 和 O_RSYNC 映射为与 O_SYNC 相同的数值。大多数 Linux 文件系统实际上并没有实现 POSIX O_SYNC 语义,它要求写入的所有元数据更新在返回到用户空间时都在磁盘上,而只有 O_DSYNC 语义,它只需要实际的文件数据和将其检索到所需的元数据在系统调用返回时已在磁盘上。

因此,这基本上意味着使用该O_SYNC标志,您尝试写入的全部数据不需要在系统调用返回之前刷新到磁盘,而只是足够的信息能够从磁盘中检索它......取决于你正在写什么,这可能比你打算写入磁盘的整个数据缓冲区少很多,因此所有数据的实际写入将在系统调用完成后的稍后时间进行这个过程已经转移到其他事情上。

于 2011-08-24T22:37:42.990 回答