c++ - std::ifstream 缓冲区缓存

Question

在我的应用程序中，我试图合并已排序的文件（当然要保持它们的排序），因此我必须遍历两个文件中的每个元素以将最小值写入第三个文件。这在大文件上工作得非常慢，就我看不到任何其他选择（必须完成迭代）而言，我正在尝试优化文件加载。我可以使用一定数量的 RAM，用于缓冲。我的意思是每次我可以读取一次类似 100Mb 的内容并在之后使用该缓冲区，而不是从两个文件中读取 4 个字节，直到缓冲区中没有元素，然后我将再次重新填充缓冲区。但我想 ifstream 已经在这样做了，它会给我更多的性能吗？有什么理由吗？如果 fstream 可以，也许我可以更改该缓冲区的大小？

添加

我当前的代码看起来像那样（伪代码）

// this is done in loop
int i1 = input1.read_integer();
int i2 = input2.read_integer();
if (!input1.eof() && !input2.eof())
{
   if (i1 < i2)
   {
      output.write(i1);
      input2.seek_back(sizeof(int));
   } else
      input1.seek_back(sizeof(int));
      output.write(i2);
   }
} else {
   if (input1.eof())
      output.write(i2);
   else if (input2.eof())
      output.write(i1);
}

我不喜欢这里的是

seek_back - 我必须回到以前的位置，因为没有办法偷看 4 个字节
从文件中读取过多
如果其中一个流在 EOF 中，它仍然会继续检查该流，而不是将另一个流的内容直接输出，但这不是一个大问题，因为块大小几乎总是相等的。

你能建议改进吗？

谢谢。

score 5 · Accepted Answer

在不讨论流缓冲区的情况下，您可以通过执行以下操作摆脱seek_back并通常使代码更简单：

using namespace std;
merge(istream_iterator<int>(file1), istream_iterator<int>(),
           istream_iterator<int>(file2), istream_iterator<int>(),
           ostream_iterator<int>(cout));

编辑：

添加了二进制功能

#include <algorithm>
#include <iterator>
#include <fstream>
#include <iostream>

struct BinInt
{
    int value;
    operator int() const { return value; }
    friend std::istream& operator>>(std::istream& stream, BinInt& data)
    {
        return stream.read(reinterpret_cast<char*>(&data.value),sizeof(int));
    }
};

int main()
{
    std::ifstream   file1("f1.txt");
    std::ifstream   file2("f2.txt");

    std::merge(std::istream_iterator<BinInt>(file1), std::istream_iterator<BinInt>(),
               std::istream_iterator<BinInt>(file2), std::istream_iterator<BinInt>(),
               std::ostream_iterator<int>(std::cout));
}

score 3 · Accepted Answer

按性能降序排列（最佳优先）：

内存映射 I/O
特定于操作系统的ReadFile或read调用。
fread进入一个大缓冲区
ifstream.read进入一个大缓冲区
ifstream和提取器

score 2 · Accepted Answer

像这样的程序应该是 I/O 绑定的，这意味着它应该花费至少 80% 的时间来等待完成读取或写入缓冲区，如果缓冲区相当大，它应该保持磁盘磁头忙碌。那正是你想要的。

不要假设它是 I/O 绑定的，没有证据。证明它的一种方法是拍摄几个堆栈照片。如果是，大多数示例将显示程序等待 I/O 完成。

它可能不受 I/O 限制，这意味着您可能会在某些示例中发现您从未预料到的其他事情。如果是这样，那么您知道要解决哪些问题以加快速度。例如，我已经看到像这样的一些代码在合并循环、测试文件结尾、获取数据以进行比较等方面花费的时间比必要的多得多。

score 0 · Accepted Answer

除非您的数据有什么特别之处，否则您不太可能改进 std::fstream 对象中内置的缓冲。

std::fstream 对象被设计为对通用文件访问非常有效。通过一次访问 4 个字节的数据，听起来您并没有做任何特别的事情。您始终可以分析您的代码，以查看代码中实际花费的时间。

也许如果您与我们共享代码，我们可能会发现一些主要的低效率。

编辑：

我不喜欢你的算法。在流上来回查找可能很困难，尤其是在缓冲区边界上的数字。每次循环我只会读取一个数字。

试试这个：
注意：这不是最佳的（它假设数字流输入（而你的看起来是二进制的））但我相信你可以将它用作起点。

#include <fstream>
#include <iostream>

// Return the current val (that was the smaller value)
// and replace it with the next value in the stream.
int getNext(int& val, std::istream& str)
{
    int result = val;
    str >> val;

    return result;
}

int main()
{
    std::ifstream   f1("f1.txt");
    std::ifstream   f2("f2.txt");
    std::ofstream   re("result");

    int v1;
    int v2;

    f1 >> v1;
    f2 >> v2;

    // While there are values in both stream
    // Output one value and replace it using getNext()
    while(f1 && f2)
    {
        re << (v1 < v2)? getNext(v1, f1) : getNext(v2, f2);
    }
    // At this point one (or both) stream(s) is(are) empty.
    // So dump the other stream.
    for(;f1;f1 >> v1)
    {
        // Note if the stream is at the end it will
        // never enter the loop
        re << v1;
    }
    for(;f2;f2 >> v2)
    {
        re << v2;
    }
}

score 0 · Accepted Answer

您可以只使用 ifstream 的读取功能来读取大块。

http://www.cplusplus.com/reference/iostream/istream/read/

第二个参数是字节数。在您的情况下，您应该将其设为 4 的倍数 - 也许是 4096？:)

只需一次读取一个块并进行处理。

正如 martin-york 所说，这可能对您的表现没有任何有益影响，但请尝试一下并找出答案。

score 0 · Accepted Answer

我认为你很有可能通过阅读大块来提高性能。

ios::binary尝试使用作为参数打开文件，然后使用istream::read读取数据。

如果您需要最高性能，我实际上建议完全跳过 iostream，而改用cstdio。但我想这不是你想要的。

c++ - std::ifstream 缓冲区缓存

6 回答 6

编辑：

编辑：

Related

Reference