c++ - 循环文件映射会降低性能

Question

我有一个由文件映射内存支持的循环缓冲区（缓冲区的大小范围为 8GB-512GB）。

我正在以从头到尾的顺序方式写入（8 个实例）该内存，此时它会循环回到开头。

它工作正常，直到它需要执行两个文件映射并在内存中循环，此时 IO 性能完全被破坏并且无法恢复（即使在几分钟后）。我不太明白。

using namespace boost::interprocess;

class mapping
{
public:

  mapping()
  {
  }

  mapping(file_mapping& file, mode_t mode, std::size_t file_size, std::size_t offset, std::size_t size)
    : offset_(offset)
    , mode_(mode)
  {     
    const auto aligned_size         = page_ceil(size + page_size());
    const auto aligned_file_size    = page_floor(file_size);
    const auto aligned_file_offset  = page_floor(offset % aligned_file_size);
    const auto region1_size         = std::min(aligned_size, aligned_file_size - aligned_file_offset);
    const auto region2_size         = aligned_size - region1_size;

    if (region2_size)
    {
      const auto region1_address  = mapped_region(file, read_only, 0, (region1_size + region2_size) * 2).get_address(); 
      const auto region2_address  = reinterpret_cast<char*>(region1_address) + region1_size;  

      region1_ = mapped_region(file, mode, aligned_file_offset, region1_size, region1_address);
      region2_ = mapped_region(file, mode, 0,                   region2_size, region2_address);
    }
    else
    {
      region1_ = mapped_region(file, mode, aligned_file_offset, region1_size);
      region2_ = mapped_region();
    }

    size_ = region1_.get_size() + region2_.get_size();
    offset_ = aligned_file_offset;
  }

  auto offset() const   -> std::size_t  { return offset_; }
  auto size() const     -> std::size_t  { return size_; }
  auto data() const     -> const void*  { return region1_.get_address(); }
  auto data()           -> void*        { return region1_.get_address(); }
  auto flush(bool async = true) -> void
  {
    region1_.flush(async);
    region2_.flush(async);
  }
  auto mode() const -> mode_t { return mode_; }

private:
  std::size_t   offset_ = 0;
  std::size_t   size_ = 0;
  mode_t        mode_;
  mapped_region region1_;
  mapped_region region2_;
};

struct loop_mapping::impl final
{     
  std::tr2::sys::path         file_path_;
  file_mapping                file_mapping_;    
  std::size_t                 file_size_;
  std::size_t                 map_size_     = page_floor(256000000ULL);

  std::shared_ptr<mapping>    mapping_ = std::shared_ptr<mapping>(new mapping());
  std::shared_ptr<mapping>    prev_mapping_;

  bool                        write_;

public:
  impl(std::tr2::sys::path path, bool write)
    : file_path_(std::move(path))
    , file_mapping_(file_path_.string().c_str(), write ? read_write : read_only)
    , file_size_(page_floor(std::tr2::sys::file_size(file_path_)))
    , write_(write)
  {     
    REQUIRE(file_size_ >= map_size_ * 3);
  }

  ~impl()
  {
    prev_mapping_.reset();
    mapping_.reset();
  }

  auto data(std::size_t offset, std::size_t size, boost::optional<bool> write_opt) -> void*
  { 
    offset = offset % page_floor(file_size_);

    REQUIRE(size < file_size_ - map_size_ * 3);

    const auto write = write_opt.get_value_or(write_);

    REQUIRE(!write || write_);          

    if ((write && mapping_->mode() == read_only) || offset < mapping_->offset() || offset + size >= mapping_->offset() + mapping_->size())
    {
      auto new_mapping = std::make_shared<loop::mapping>(file_mapping_, write ? read_write : read_only, file_size_, page_floor(offset), std::max(size + page_size(), map_size_));

      if (mapping_)
        mapping_->flush((new_mapping->offset() % file_size_) < (mapping_->offset() % file_size_));

      if (prev_mapping_)
        prev_mapping_->flush(false);

      prev_mapping_ = std::move(mapping_);
      mapping_    = std::move(new_mapping);
    }

    return reinterpret_cast<char*>(mapping_->data()) + offset - mapping_->offset();
  }
}

-

// 8 processes to 8 different files 128GB each.
loop_mapping loop(...);
for (auto n = 0; true; ++n)
{
     auto src = get_new_data(5000000/8);
     auto dst = loop.data(n * 5000000/8, 5000000/8, true);
     std::memcpy(dst, src, 5000000/8); // This becomes very slow after loop around.
     std::this_thread::sleep_for(std::chrono::seconds(1));
}

有任何想法吗？

目标系统：

1 个 3TB 希捷 Constellation ES.3
2x Xeon E5-2400（6 核，2.6Ghz）
6x 8GB DDR3 1600Mhz ECC
视窗服务器 2012

score 1 · Accepted Answer

由于您的代码没有任何注释，充满了自动变量，无法按原样编译，而且我的 PC 上也没有 512Gb 可用的空间来测试它，所以这仍然是我头脑中的一个过客。

您的每个进程只写入几百 Kb/s，因此应该有足够的时间在后台将其刷新到磁盘。

但是，您似乎要求 boost 映射系统根据您的神秘偏移计算同步或异步刷新前一个块：

mapping_->flush((new_mapping->offset() % file_size_) < (mapping_->offset() % file_size_));

我猜翻转触发了同步刷新，这可能是突然减速的罪魁祸首。

操作系统在这一点上所做的取决于 boost 实现，它没有被描述（或者至少以一种足够明显的方式让我在粗略地查看他们的手册页后得到它）。如果 boost 用未刷新的页面填充了 48 Gb 的内存，您肯定会经历突然而长时间的减速。

如果这条神秘的线做了一些我完全错过的聪明且完全不同的事情，至少值得在你的代码中发表评论。

score 1 · Accepted Answer

在具有 48GiB 物理内存的系统上，每个大小为 8 到 512GiB 的 8 个缓冲区意味着必须交换映射。那里并不奇怪。
正如您自己已经指出的那样，问题是在能够写入页面之前，您遇到了错误，并且页面被读入。第一次运行时不会发生这种情况，因为只有零页面是用过的。更糟糕的是，再次读入页面会与脏页后写竞争。

现在，不幸的是，没有办法告诉 Windows “我无论如何都要覆盖它”，也没有办法让磁盘更快地加载你的东西。但是，您可以更早地开始传输（也许当您通过缓冲区的 3/4 时）。

Windows Server 2012（您正在使用）支持PrefetchVirtualMemory，它是 POSIX 的半途而废的替代品madvise(MADV_WILLNEED)。

当然，当您已经知道无论如何都会覆盖整个内存页面（或其中几个）时，这并不完全是您想要做的事情，但它已经尽可能好。无论如何都值得一试。

理想情况下，您会想要madvise(MADV_DONTNEED)在覆盖页面之前立即执行诸如在 Linux（我也相信 FreeBSD）下实现的破坏性操作，但我不知道在 Windows 下执行此操作的任何方式（...短从头开始破坏视图和映射和映射，但随后您丢弃所有数据，所以这有点没用）。

即使提前预取，您仍然会受到磁盘 I/O 带宽的限制，但至少可以隐藏延迟。

另一个“显而易见”（但可能不是那么容易）的解决方案是让消费者更快。这将允许一个较小的缓冲区开始，即使在一个巨大的缓冲区上，它也会使工作集更小（生产者和消费者在访问它们时都会强制页面进入 RAM，所以如果消费者在生产者访问数据后以更少的延迟访问数据）编写它们时，它们都将使用大部分相同的页面集。）较小的工作集更容易放入 RAM。
但我意识到您可能没有无缘无故地选择数 GB 的缓冲区。

score 1 · Accepted Answer

如果您能够使用页面文件而不是特定文件来支持内存映射，则可以使用MEM_RESET标志 withVirtualAlloc来防止 Windows 在旧内容中分页。

我预计使用这种方法的主要问题是，完成后您无法轻松恢复磁盘空间。它还可能需要更改系统的页面文件设置；我相信它可以使用默认设置，但如果设置了最大页面文件大小则不行。

score 0 · Accepted Answer

我将假设“循环”是指 RAM 已满。发生的情况是，在 RAM 变满之前，您所要做的就是分配一个页面并在其中写入（RAM 速度），在 RAM 变满后，每个页面分配都会变成 2 个操作：1. 您必须写入脏页返回（磁盘速度） 2. 并分配一个页面（RAM 速度）

最坏的情况是，如果您正在从中读取内容，您还必须从磁盘中的文件（磁盘速度）中获取页面。因此，每个页面分配都以磁盘速度运行，而不是仅以 RAM 速度（页面分配）运行。2x8GB 不会发生这种情况，因为它足够小，两个文件的所有内存都可以完全保留在 RAM 中。

score 0 · Accepted Answer

事实证明，这里的问题是，当覆盖内存中的有效页面时，该页面首先必须从驱动器中读取，然后才能被覆盖。据我所知，在使用内存映射文件时，没有办法解决这个问题。

在第一次通过时没有发生这种情况的原因是被覆盖的页面不是“有效的”，因此不需要回读它们。

c++ - 循环文件映射会降低性能

5 回答 5

Related

Reference