c - read() system call page fault doesn't depend on file size

Question

I am reading different sized files (1KB - 1GB) using read() in C. But everytime I check the page-faults using perf-stat, it always gives me the same (almost) values.

My machine: (fedora 18 on a Virtual Machine, RAM - 1GB, Disk space - 20 GB)

uname -a
Linux localhost.localdomain 3.10.13-101.fc18.x86_64 #1 SMP Fri Sep 27 20:22:12 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

mount | grep "^/dev"
/dev/mapper/fedora-root on / type ext4 (rw,relatime,seclabel,data=ordered)
/dev/sda1 on /boot type ext4 (rw,relatime,seclabel,data=ordered)

My code:

 10 #define BLOCK_SIZE 1024
. . . 
 19         char text[BLOCK_SIZE];
 21         int total_bytes_read=0;
. . .

 81         while((bytes_read=read(d_ifp,text,BLOCK_SIZE))>0)
 82         {
 83                 write(d_ofp, text, bytes_read); // writing to /dev/null
 84                 total_bytes_read+=bytes_read;
 85                 sum+=(int)text[0];  // doing this just to make sure there's 
                                             // no lazy page loading by read()
                                             // I don't care what is in `text[0]`
 86         }
 87         printf("total bytes read=%d\n", total_bytes_read);
 88         if(sum>0)
 89                 printf("\n");

Perf-stat output: (shows file size, time to read the file and the # of page faults)

[read]:   f_size:    1K B, Time:  0.000313 seconds, Page-faults: 150, Total bytes read: 980 
[read]:   f_size:   10K B, Time:  0.000434 seconds, Page-faults: 151, Total bytes read: 11172
[read]:   f_size:  100K B, Time:  0.000442 seconds, Page-faults: 150, Total bytes read: 103992
[read]:   f_size:    1M B, Time:  0.00191  seconds, Page-faults: 151, Total bytes read: 1040256
[read]:   f_size:   10M B, Time:  0.050214 seconds, Page-faults: 151, Total bytes read: 10402840 
[read]:   f_size:  100M B, Time:  0.2382   seconds, Page-faults: 150, Total bytes read: 104028372 
[read]:   f_size:    1G B, Time:  5.7085   seconds, Page-faults: 148, Total bytes read: 1144312092

Questions:
1. How can the page-faults for a file read() of size of 1KB & 1GB be same ? Since I am reading the data too (code line #84), I am making sure the data is being actually read.
2. The only reason that I can think of that it doesn't encounter that many page-faults is because the data is already present in the main memory. If this is the case, how can I flush it so that when I run my code it actually shows me the true page-faults ? Otherwise I can never measure the true performance of read().

Edit1:
echo 3 > /proc/sys/vm/drop_caches doesn't help, the output still remains the same.

Edit2: For mmap, the output of perf-stat is:

[mmap]:   f_size:    1K B, Time:  0.000103 seconds, Page-faults: 14
[mmap]:   f_size:   10K B, Time:  0.001143 seconds, Page-faults: 151
[mmap]:   f_size:  100K B, Time:  0.002367 seconds, Page-faults: 174
[mmap]:   f_size:    1M B, Time:  0.007634 seconds, Page-faults: 401
[mmap]:   f_size:   10M B, Time:  0.06812  seconds, Page-faults: 2,688
[mmap]:   f_size:  100M B, Time:  0.60386  seconds, Page-faults: 25,545
[mmap]:   f_size:    1G B, Time:  4.9869   seconds, Page-faults: 279,519

score 5 · Accepted Answer

我想你不明白页面错误到底是什么。根据 Wikipedia的说法， pagefault是一个“陷阱”（异常），是一种中断，由 CPU 自己在程序尝试访问某些内容时生成，该内容未加载到物理内存中（但通常已经在虚拟内存中注册了它的页面标记为“不存在” P：存在位 = 0）。

Pagefault 很糟糕，因为它强制 CPU 停止执行用户程序并切换到内核。内核模式下的页面错误并不常见，因为内核可以在访问之前检查页面是否存在。如果内核函数想要向新页面（在您的情况下为read系统调用）写入内容，它将通过显式调用页面分配器来分配页面，而不是通过尝试访问它并导致页面错误。通过显式内存管理执行的中断更少，代码也更少。

--- 阅读案例 ---

您的读取由sys_readfs /read_write.c处理。这是调用链（可能不准确）：

472 SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
479                 ret = vfs_read(f.file, buf, count, &pos);
  vvv
353 ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos)
368                         ret = file->f_op->read(file, buf, count, pos);
  vvv

fs/ext4/file.c

626 const struct file_operations ext4_file_operations = {
628         .read           = do_sync_read,

... do_sync_read -> generic_file_aio_read -> do_generic_file_read

毫米/文件映射.c

1100 static void do_generic_file_read(struct file *filp, loff_t *ppos,
1119         for (;;) {
1120                 struct page *page;
1127                 page = find_get_page(mapping, index);
1128                 if (!page) {
1134                                 goto no_cached_page;  
  // osgx - case when pagecache is empty  ^^vv
1287 no_cached_page:
1288                 /*
1289                  * Ok, it wasn't cached, so we need to create a new
1290                  * page..
1291                  */
1292                 page = page_cache_alloc_cold(mapping);

包括/linux/pagemap.h

233 static inline struct page *page_cache_alloc_cold(struct address_space *x)
235         return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
  vvv
222 static inline struct page *__page_cache_alloc(gfp_t gfp)
224         return alloc_pages(gfp, 0);

所以我可以通过直接调用跟踪read()系统调用以页面分配（）结束。alloc_pages分配页面后，read()syscall 会将数据从 HDD DMA 传输到新页面，然后返回给用户（考虑文件未缓存在 pagecache 中的情况）。如果数据已经在页面缓存中，read()( do_generic_file_read) 将通过创建额外的映射重用页面缓存中的现有页面，而无需实际读取 HDD。

返回后read()，所有数据都在内存中，对其进行读访问不会产生pagefault。

--- mmap 案例 ---

如果你重写测试来做mmap()你的文件，然后访问（text[offset]）你的文件的不存在的页面（它不在页面缓存中），真正的页面错误将会发生。

仅当 CPU 生成真正的页面错误陷阱时，才会更新所有页面错误计数器 (perf stat和)。/proc/$pid/stat这是页面错误arch/x86/mm/fault.c的 x86 处理程序，它将起作用

1224 dotraplinkage void __kprobes
1225 do_page_fault(struct pt_regs *regs, unsigned long error_code)
1230         __do_page_fault(regs, error_code);
  vvv
1001 /*
1002  * This routine handles page faults.  It determines the address,
1003  * and the problem, and then passes it off to one of the appropriate
1004  * routines.
1005  */
1007 __do_page_fault(struct pt_regs *regs, unsigned long error_code)
 /// HERE is the perf stat pagefault event generator VVV 
1101         perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);

稍后页面错误处理程序将调用 handle_mm_fault-> handle_pte_fault->__do_fault以vma->vm_ops->fault(vma, &vmf);.

这个fault虚函数被注册了mmap，我想是的filemap_fault。__alloc_page此函数将在页面缓存为空的情况下执行实际的页面分配（在页面缓存中，算作“次要”页面错误，因为它是在没有外部 I/O 的情况下完成的，并且通常更快）。

PS：在虚拟平台上做实验可能会有所改变；例如，即使在通过清理来宾 Fedora 中的磁盘缓存（pagecache）之后echo 3 > /proc/sys/vm/drop_caches，来自虚拟硬盘的数据仍然可以被主机操作系统缓存。

c - read() system call page fault doesn't depend on file size

1 回答 1

Related

Reference