10

I am writing a routine to compare two files using memory-mapped file. In case files are too big to be mapped at one go. I split the files and map them part by part. For example, to map a 1049MB file, I split it into 512MB + 512MB + 25MB.

Every thing works fine except one thing: it always take much, much longer to compare the remainder (25MB in this example), though the code logic is exactly the same. 3 observations:

  1. it does not matter which is compared first, whether the main part (512MB * N) or the remainder (25MB in this example) comes first, the result remains the same
  2. the extra time in the remainder seems to be spent in the user mode
  3. Profiling in VS2010 beta 1 shows, the time is spent inside t std::_Equal(), but this function is mostly (profiler says 100%) waiting for I/O and other threads.

I tried

  • changing the VIEW_SIZE_FACTOR to another value
  • replacing the lambda functor with a member function
  • changing the file size under test
  • changing the order of execution of the remainder to before/after the loop

The result was quite consistent: it takes a lot more time in the remainder part and in the User Mode.

I suspect it has something to do with the fact that the mapped size is not a multiple of mapping alignment (64K on my system), but not sure how.

Below is the complete code for the routine and a timing measured for a 3G file.

Can anyone please explain it, Thanks?

// using memory-mapped file
template <size_t VIEW_SIZE_FACTOR>
struct is_equal_by_mmapT
{
public:
    bool operator()(const path_type& p1, const path_type& p2)
    {
        using boost::filesystem::exists;
        using boost::filesystem::file_size;

        try
        {
            if(!(exists(p1) && exists(p2))) return false;

            const size_t segment_size = mapped_file_source::alignment() * VIEW_SIZE_FACTOR;  

            // lanmbda 
            boost::function<bool(size_t, size_t)> segment_compare = 
            [&](size_t seg_size, size_t offset)->bool 
            {
                using boost::iostreams::mapped_file_source;
                boost::chrono::run_timer t;     

                mapped_file_source mf1, mf2;  

                mf1.open(p1, seg_size, offset);
                mf2.open(p2, seg_size, offset);

                if(! (mf1.is_open() && mf2.is_open())) return false;

                if(!equal (mf1.begin(), mf1.end(), mf2.begin())) return false;  

                return true;
            };

            boost::uintmax_t size = file_size(p1);
            size_t round     = size / segment_size;
            size_t remainder = size & ( segment_size - 1 );

            // compare the remainder
            if(remainder > 0)
            {
                cout << "segment size = " 
                     << remainder 
                     << " bytes for the remaining round";
                if(!segment_compare(remainder, segment_size * round)) return false;    
            }   

            //compare the main part.  take much less time, even 
            for(size_t i = 0; i < round; ++i)
            {
                cout << "segment size = " 
                     << segment_size 
                     << " bytes, round #" << i;
                if(!segment_compare(segment_size, segment_size * i))  return false;
            }
        }
        catch(std::exception& e)
        {
            cout << e.what();
            return false;
        }

        return true;                                      
    }
};

typedef is_equal_by_mmapT<(8<<10)> is_equal_by_mmap;  // 512MB  

output:

segment size = 354410496 bytes for the remaining round

real 116.892s, cpu 56.201s (48.1%), user 54.548s, system 1.652s

segment size = 536870912 bytes, round #0

real 72.258s, cpu 2.273s (3.1%), user 0.320s, system 1.953s

segment size = 536870912 bytes, round #1

real 75.304s, cpu 1.943s (2.6%), user 0.240s, system 1.702s

segment size = 536870912 bytes, round #2

real 84.328s, cpu 1.783s (2.1%), user 0.320s, system 1.462s

segment size = 536870912 bytes, round #3

real 73.901s, cpu 1.702s (2.3%), user 0.330s, system 1.372s


More observations after the suggestions by responders

Further split the remainder into body and tail(remainder = body + tail), where

  • body = N * alignment(), and tail < 1 * alignment()
  • body = m * alignment(), and tail < 1 * alignment() + n * alignment(), where m is even.
  • body = m * alignment(), and tail < 1 * alignment() + n * alignment(), where m is exponents of 2.
  • body = N * alignment(), and tail = remainder - body. N is random.

the total time remains unchanged, but I can see that the time does not necessary relate to tail, but to size of body and tail. the bigger part takes more time. The time is USER TIME, which is most incomprehensible to me.

I also look at the pages faults through Procexp.exe. the remainder does NOT take more faults than the main loop.


Updates 2

I've performed some test on other workstations, and it seem the issue is related to the hardware configuration.

Test Code

// compare the remainder, alternative way
if(remainder > 0)
{
    //boost::chrono::run_timer t;       
    cout << "Remainder size = " 
         << remainder 
         << " bytes \n";

    size_t tail = (alignment_size - 1) & remainder;
    size_t body = remainder - tail;

{
    boost::chrono::run_timer t;                               
    cout << "Remainder_tail size = " << tail << " bytes";
    if(!segment_compare(tail, segment_size * round + body)) return false;
}                        
{
    boost::chrono::run_timer t;                               
    cout << "Remainder_body size = " << body << " bytes";
    if(!segment_compare(body, segment_size * round)) return false; 
}                        

}

Observation:

On another 2 PCs with the same h/w configurations with mine, the result is consistent as following:

------VS2010Beta1ENU_VSTS.iso [1319909376 bytes] ------

Remainder size = 44840960 bytes

Remainder_tail size = 14336 bytes

real 0.060s, cpu 0.040s (66.7%), user 0.000s, system 0.040s

Remainder_body size = 44826624 bytes

real 13.601s, cpu 7.731s (56.8%), user 7.481s, system 0.250s

segment size = 67108864 bytes, total round# = 19

real 172.476s, cpu 4.356s (2.5%), user 0.731s, system 3.625s

However, running the same code on a PC with a different h/w configuration yielded:

------VS2010Beta1ENU_VSTS.iso [1319909376 bytes] ------ Remainder size = 44840960 bytes

Remainder_tail size = 14336 bytes

real 0.013s, cpu 0.000s (0.0%), user 0.000s, system 0.000s

Remainder_body size = 44826624 bytes

real 2.468s, cpu 0.188s (7.6%), user 0.047s, system 0.141s

segment size = 67108864 bytes, total round# = 19

real 65.587s, cpu 4.578s (7.0%), user 0.844s, system 3.734s

System Info

My workstation yielding imcomprehensible timing:

OS Name: Microsoft Windows XP Professional

OS Version: 5.1.2600 Service Pack 3 Build 2600

OS Manufacturer: Microsoft Corporation

OS Configuration: Member Workstation

OS Build Type: Uniprocessor Free

Original Install Date: 2004-01-27, 23:08

System Up Time: 3 Days, 2 Hours, 15 Minutes, 46 Seconds

System Manufacturer: Dell Inc.

System Model: OptiPlex GX520

System type: X86-based PC

Processor(s): 1 Processor(s) Installed.

                       [01]: x86 Family 15 Model 4 Stepping 3 GenuineIntel ~2992 Mhz

BIOS Version: DELL - 7

Windows Directory: C:\WINDOWS

System Directory: C:\WINDOWS\system32

Boot Device: \Device\HarddiskVolume2

System Locale: zh-cn;Chinese (China)

Input Locale: zh-cn;Chinese (China)

Time Zone: (GMT+08:00) Beijing, Chongqing, Hong Kong, Urumqi

Total Physical Memory: 3,574 MB

Available Physical Memory: 1,986 MB

Virtual Memory: Max Size: 2,048 MB

Virtual Memory: Available: 1,916 MB

Virtual Memory: In Use: 132 MB

Page File Location(s): C:\pagefile.sys

NetWork Card(s): 3 NIC(s) Installed.

       [01]: VMware Virtual Ethernet Adapter for VMnet1

             Connection Name: VMware Network Adapter VMnet1

             DHCP Enabled:    No

             IP address(es)

             [01]: 192.168.75.1

       [02]: VMware Virtual Ethernet Adapter for VMnet8

             Connection Name: VMware Network Adapter VMnet8

             DHCP Enabled:    No

             IP address(es)

             [01]: 192.168.230.1

       [03]: Broadcom NetXtreme Gigabit Ethernet

             Connection Name: Local Area Connection 4

             DHCP Enabled:    Yes

             DHCP Server:     10.8.0.31

             IP address(es)

             [01]: 10.8.8.154

Another workstation yielding "correct" timing: OS Name: Microsoft Windows XP Professional

OS Version: 5.1.2600 Service Pack 3 Build 2600

OS Manufacturer: Microsoft Corporation

OS Configuration: Member Workstation

OS Build Type: Multiprocessor Free

Original Install Date: 5/18/2009, 2:28:18 PM

System Up Time: 21 Days, 5 Hours, 0 Minutes, 49 Seconds

System Manufacturer: Dell Inc.

System Model: OptiPlex 755

System type: X86-based PC

Processor(s): 1 Processor(s) Installed.

        [01]: x86 Family 6 Model 15 Stepping 13 GenuineIntel ~2194 Mhz

BIOS Version: DELL - 15

Windows Directory: C:\WINDOWS

System Directory: C:\WINDOWS\system32

Boot Device: \Device\HarddiskVolume1

System Locale: zh-cn;Chinese (China)

Input Locale: en-us;English (United States)

Time Zone: (GMT+08:00) Beijing, Chongqing, Hong Kong, Urumqi

Total Physical Memory: 3,317 MB

Available Physical Memory: 1,682 MB

Virtual Memory: Max Size: 2,048 MB

Virtual Memory: Available: 2,007 MB

Virtual Memory: In Use: 41 MB

Page File Location(s): C:\pagefile.sys

NetWork Card(s): 3 NIC(s) Installed.

       [01]: Intel(R) 82566DM-2 Gigabit Network Connection

             Connection Name: Local Area Connection

             DHCP Enabled:    Yes

             DHCP Server:     10.8.0.31

             IP address(es)

             [01]: 10.8.0.137

       [02]: VMware Virtual Ethernet Adapter for VMnet1

             Connection Name: VMware Network Adapter VMnet1

             DHCP Enabled:    Yes

             DHCP Server:     192.168.154.254

             IP address(es)

             [01]: 192.168.154.1

       [03]: VMware Virtual Ethernet Adapter for VMnet8

             Connection Name: VMware Network Adapter VMnet8

             DHCP Enabled:    Yes

             DHCP Server:     192.168.2.254

             IP address(es)

             [01]: 192.168.2.1

Any explanation theory? Thanks.

4

6 回答 6

4

这种行为看起来很不合逻辑。我想知道如果我们尝试一些愚蠢的事情会发生什么。如果整个文件大于 512MB,您可以再次比较最后一部分的完整 512MB,而不是剩余大小。

就像是:

        if(remainder > 0)
        {
            cout << "segment size = " 
                 << remainder 
                 << " bytes for the remaining round";
                if (size > segment_size){
                    block_size = segment_size;
                    offset = size - segment_size;
                }
                else{
                    block_size = remainder;
                    offset = segment_size * i
                }
            if(!segment_compare(block_size, offset)) return false;    
        }   

这似乎是一件非常愚蠢的事情,因为我们会两次比较文件的一部分,但如果您的分析数据准确,它应该会更快。

它(还)不会给我们答案,但如果它确实更快,这意味着我们正在寻找的响应在于您的程序对小数据块所做的事情。

于 2009-09-04T22:05:43.343 回答
2

我知道这不是您问题的确切答案;但是您是否尝试过回避整个问题——即一次映射整个文件?

我对Win32内存管理知之甚少;但在 Linux 上,您可以将MAP_NORESERVE标志与 一起使用mmap(),因此您无需为整个文件大小保留 RAM。考虑到您只是从这两个文件中读取,如果内存不足,操作系统应该能够随时丢弃页面......

于 2009-09-05T12:02:33.847 回答
2

您要比较的文件的碎片程度如何?您可以使用FSCTL_GET_RETRIEVAL_POINTERS来获取文件映射到磁盘上的范围。我怀疑最后 25 MB 会有很多小范围来说明您测量的性能。

于 2009-09-03T16:28:19.000 回答
2

我想知道当段的页数不是偶数时, mmap 的行为是否奇怪?也许您可以尝试通过将段大小逐渐减半来处理文件的最后部分,直到达到小于 mapped_file_source::alignment() 的大小并专门处理最后一点。

此外,您说您正在执行 512MB 块,但您的代码将大小设置为 8<<10。然后它乘以 mapped_file_source::alignment()。mapped_file_source::alignment() 真的是 65536 吗?

我建议,为了更便携并减少混淆,您只需使用模板参数中给定的大小,并简单地要求它是代码中 mapped_file_source::alignment() 的偶数倍。或者让人们通过 2 的幂来开始块大小,或者其他什么。将块大小作为模板参数传入然后乘以一些奇怪的实现定义的常量似乎有点奇怪。

于 2009-09-03T21:18:21.403 回答
1

出于好奇,我会在 Linux 或 BSD 上尝试一下,看看它是如何工作的。

我对这个问题有一个非常粗略的猜测:我打赌 Windows 正在做很多额外的检查以确保它不会映射到文件末尾。过去,某些操作系统存在安全问题,允许 mmap 用户查看文件系统私有数据或地图末尾区域内其他文件的数据,因此此处小心对于操作系统设计人员来说是个好主意. 因此,Windows 可能会使用更谨慎的“将数据从磁盘复制到内核,将未映射的数据归零,将数据复制到用户”,而不是更快的“将数据从磁盘复制到用户”。

尝试映射到文件末尾的下方,不包括不适合 64K 块的最后一个字节。

于 2009-09-05T05:09:26.383 回答
0

会不会是病毒扫描程序导致了这些奇怪的结果?您是否尝试过不使用病毒扫描程序?

问候,

塞巴斯蒂安

于 2009-09-07T06:47:26.110 回答