c++ - 多流迭代器 C++

Question

我的程序的目的是打开一个长度相同的m行的文本文件n，逐列读取文件并打印每一列。

例如，对于这个文本文件

abcd
efgh 
jklm

我想打印

a e j
b f k
c g l
d h m

由于一行长度可以是 200 000 000，而列长度可以超过 10 000，所以我无法在矩阵中打开内存中的所有文件。

从理论上讲，我想要一个在空间上使用 O(m) 并在时间上使用 O(m*n) 的程序。

一开始，我不得不考虑这些解决方案：

如果我看到每一列的所有文件，复杂度是 O(m*n²)，
如果我使用 seekg 和一个位置数组并从一个位置跳到另一个位置，复杂度是 O(m n log(n))。

最后一点，对于一些服务器问题，我只需要使用 STL。

我的最后一个想法是创建一个文件的迭代器数组，并在每行的开头初始化这些迭代器。之后，要查看下一列，我只需要增加每个迭代器。这是我的代码

ifstream str2;
str2.open ("Input/test.data", ifstream::in);

int nbline = 3;
int nbcolumn = 4;
int x = 0;

istreambuf_iterator<char> istart (str2);
istreambuf_iterator<char> iend ;

istreambuf_iterator<char>* iarray;
iarray = new istreambuf_iterator<char>[nbline];


while (istart != iend){
    if (x % nbcolumn == 0){
        iarray[x/nbcolumn] = istart;
    }
    istart++;
    x++;
}

for (int j = 0; j<nbcolumn;j++){
    for (int i = 0; i<nbline;i++){
        cout  << *iarray[i] << "\t";
        iarray[i]++;
    }
    cout << endl;
}

可悲的是，它不起作用，我有这个东西作为输出

a       e       f       
�       �       �       
�       �       �       
�       �       �

我认为问题在于迭代器数组iarray并不独立于istart，我该怎么做？

score 6 · Accepted Answer

您可以将任务分成块，然后处理每个块，然后再继续下一个。

您需要为每一行提供一个缓冲区（这越大，性能越好）以及该行的搜索位置。您可能还需要通过文件进行初始传递以获得每行的正确偏移量。

将 B 字节读入每行的缓冲区（tellg用于保存每行中的位置），然后循环遍历这些字节并生成输出。返回并从每一行读取下一个 B 字节（seekg用于预先设置文件位置，并tellg在之后记住它）并生成输出。重复直到你完成，小心最后一块（或小输入）不要超过行尾。

使用您的示例，您需要跟踪 3 行。使用大小为 2 的 B，您将读入ab、ef和jk3 个缓冲区。循环遍历您要输出的那些aej和bfk. 返回并阅读下一个块：cd、gh和lm。这给出cgl和dhm作为输出。

score 5 · Accepted Answer

我会这样做：

打开源文件。
测量线条尺寸
测量行数（文件大小/（行大小 + EOL 大小））。注意 EOL 可以是 2 个字节。
计算结果文件大小。打开结果文件并强制它具有所需的大小，以便稍后您可以查找文件的任何部分。
峰值一些大小的正方形，这是内存可管理的。例如 1024x1024
现在您应该加载矩阵的正方形部分。1024 行构成 1024 行的元素。
转置正方形
通过寻找您正在编写的行的每个部分的正确列将其写入目标文件。（您可以通过转置一列然后将其写入一行来减少前一点的内存消耗，而不是一次转置整个正方形）
在整个文件矩阵上迭代平方

海事组织你不能做得更好。最关键的是如何选择正方形的大小。推荐2的大功率。

score 1 · Accepted Answer

如果您想使用多个std::istreambuf_iterators 来执行此操作，那么您将需要多个 s 来执行此操作fstreams，否则当您迭代一个（即istart++）将影响该迭代器的所有迭代器时，fstream这意味着下次您迭代一个（即*iarray[i]++）时，您将跳过一个字符。这在参考文献中有更清楚的解释。考虑这个片段：

std::ifstream str;
str.open("test.data", std::ifstream::in);

std::istreambuf_iterator<char> i1 (str);
std::istreambuf_iterator<char> i2 (str);

std::cout << "i1 - " << *i1 << "   i2 - " << *i2 << std::endl;
i1++;
std::cout << "i1 - " << *i1 << "   i2 - " << *i2 << std::endl;
i2++;
std::cout << "i1 - " << *i1 << "   i2 - " << *i2 << std::endl;

这将输出

i1 - a   i2 - a
i1 - b   i2 - a
i1 - b   i2 - c

流中i2似乎“跳过”b的地方。即使您稍后分配第二个迭代器，即

std::ifstream str;
str.open("test.data", std::ifstream::in);

std::istreambuf_iterator<char> i1 (str);
std::istreambuf_iterator<char> i2;
std::istreambuf_iterator<char> iend;

int x = 0;
while (i1 != iend) {
    if (x % 4 == 0) {
        i2 = i1;
        break;
    }
    x++;
    i1++;
}

std::cout << *i1 << " " << *i2 << std::endl;
i1++;
std::cout << *i1 << " " << *i2 << std::endl;
i2++;
std::cout << *i1 << " " << *i2 << std::endl;

输出保持不变 -

i1 - a   i2 - a
i1 - b   i2 - a
i1 - b   i2 - c

为什么？

因为在任何一种情况下，两个迭代器都作用于同一个流对象，并且每次迭代一个它都会从流中删除一个字符。在有问题的代码中，每个迭代器 ( istart, iarray[i]) 都作用于同一个流对象，因此其中一个迭代器的每次迭代都会char从流中删除 a。然后，输出很快就会成为未定义行为的结果，因为迭代超出流的末尾是未定义的（并且由于迭代器一起迭代，因此您可以快速到达它）。

如果您想按照大纲的方式执行此操作，您只需要多个fstream对象，例如

#include <fstream>
#include <string>
#include <iostream>


int main(int argn, char** argv) {
    std::ifstream str2;
    str2.open ("test.data", std::ifstream::in);

    int nbline = 3;
    int nbcolumn = 4;
    int x = 0;

    std::istreambuf_iterator<char> istart (str2);
    std::istreambuf_iterator<char> iend ;

    std::ifstream* streams = new std::ifstream[nbline];
    for (int ii = 0; ii < nbline; ii++) {
        streams[ii].open("test.data", std::ifstream::in);
    }
    std::istreambuf_iterator<char>* iarray = new std::istreambuf_iterator<char>[nbline];
    for (int ii = 0; ii < nbline; ii ++) {
        iarray[ii] = std::istreambuf_iterator<char> (streams[ii]);
    }

    int idx = 0;
    while (istart != iend) {
        if (x % nbcolumn == 0) {
            std::advance(iarray[x/nbcolumn], (nbcolumn+1)*idx);
            idx++;
        }
        x++;
        istart++;
    }

    for (int ii = 0; ii < nbcolumn; ii ++) {
        for (int jj = 0; jj < nbline; jj ++) {
            std::cout << *iarray[jj]++ << "\t";
        }
        std::cout << std::endl;
    }
}

这会产生您期望的输出，

a       e       j
b       f       k
c       g       l
d       h       m

相对于其他建议的方法，我无法评论此方法的速度，但这就是您使用此方法执行您所要求的操作的方式。

score 0 · Accepted Answer

您不能使用 istreambuf_iterator 两次，它只能使用一次。无论如何希望下面的代码可以帮助你

让我先解释一下我要做什么；您知道按顺序执行文件读取会快得多。我在那里做的是缓冲读取。假设在您的示例中，我正在缓冲两行，因此我必须分配 6 个字节的缓冲区并用搜索填充它；每次读取将读取两个字节，因为我们持有两行。这可以优化，但如果您在立即阅读时打印出第一个字符，您可以仅通过使用 3 个字节来缓冲两行，而在示例中仅通过缓冲 6 个字节来缓冲三行。无论如何，我给你它的非优化版本。

再次让我提醒您，您不能使用 istreambuf_iterator 两次：如何在 C++ 中两次在 ifstream 上使用迭代器？

如果您必须使用迭代器，您可以实现可以在文件上查找和读取的迭代器；虽然可能真的很乱,,,

#include <iostream>
#include <fstream>
#include <vector>
#include <stdexcept>
#include <sstream>
#include <algorithm>

std::vector<std::size_t> getPositions(std::ifstream& str2, int &numcolumns) {
    std::vector<std::size_t> iarray;

    iarray.push_back(0); // Add first iterator

    bool newlinereached = false;
    int tmpcol = 0;
    int currentLine = 0;
    char currentChar = 0;
    char previosChar = 0;

    numcolumns = -1;

    for (str2.seekg(0, std::ios_base::beg); !str2.eof(); previosChar = currentChar) {
        const std::size_t currentPosition = str2.tellg();
        str2.read(&currentChar, 1);
        if (newlinereached) {
            if (currentChar == '\r') {
                // Always error but skip for now :)
                continue;
            }
            else if (currentChar == '\n') {
                // ERROR CONDITION WHEN if (numcolumns < 0) or previosChar == '\n'
                continue;
            }
            else if (tmpcol == 0) {
                throw std::runtime_error((std::stringstream() << "Line " << currentLine << " is empty").str());
            }
            else {
                if (numcolumns < 0) {
                    // We just found first column size
                    numcolumns = tmpcol;
                    iarray.reserve(numcolumns);
                }
                else if (tmpcol != numcolumns) {
                    throw std::runtime_error((std::stringstream() << "Line " << currentLine
                        << " have incosistend number of columns it should have been " << numcolumns).str());
                }

                iarray.push_back(currentPosition);
                tmpcol = 1;
                newlinereached = false;
            }
        }
        else if (currentChar == '\r' || currentChar == '\n') {
            newlinereached = true;
            ++currentLine;
        }
        else {
            tmpcol++;
        }
    }

    if (currentChar == 0) {
        throw std::runtime_error((std::stringstream() << "Line " << currentLine
            << " contains 'null' character " << numcolumns).str());
    }

    str2.clear(); // Restart 

    return iarray;
}

int main() {
    using namespace std;

    ifstream str2;
    str2.open("Text.txt", ifstream::in);
    if (!str2.is_open()) {
        cerr << "Failed to open the file" << endl;
        return 1;
    }

    int numinputcolumns = -1;

    std::vector<std::size_t> iarray =
        getPositions(str2, numinputcolumns); // S(N)

    const std::size_t numinputrows = iarray.size();

    std::vector<char> buffer;
    const int numlinestobuffer = std::min(2, numinputcolumns); // 1 For no buffer

    buffer.resize(numinputrows * numlinestobuffer); // S(N)

    const std::size_t bufferReadMax = buffer.size();


    for (int j = 0; j < numinputcolumns; j += numlinestobuffer)
    {
        // Seek fill buffer. Needed because sequental reads are much faster even on SSD
        // Still can be optimized more: We can buffer n+1 rows as we can discard current row read
        std::size_t nread = std::min(numlinestobuffer, numinputcolumns - j);
        for (int i = 0; i < numinputrows; ++i)
        {
            str2.seekg(iarray[i], ios_base::beg);
            size_t p = str2.tellg();
            str2.read(&buffer[i * numlinestobuffer], nread);
            iarray[i] += nread;
        }

        // Print the buffer
        for (int b = 0; b < nread; ++b)
        {
            for (int k = 0; k < numinputrows; ++k) {
                std::cout << buffer[b + k * numlinestobuffer] << '\t';
            }
            std::cout << std::endl;
        }
    }

    return 0;
}

score 0 · Accepted Answer

一般注意事项

如果迭代器数组可以工作，它必须遍历内存（另见回答威廉米勒），或者它应该在哪里迭代？

权衡是：

解析直到第一个输出行完成，而不是所有其他输出行
- 慢，几乎没有使用内存
完全填充一个矩阵并输出转置矩阵
- 大量内存要使用
为所有输出线创建一个位置数组，搜索所有位置
- 快速、合理的内存使用
方法 2 和 3 的智能组合。
- 使用给定的内存（例如，假设 8 GB RAM）尽可能缩短时间。

权衡的解决方案 4

需要更多关于边界条件的知识。

解决方案 4 的概念取决于许多未知条件

输入数据的特点是什么？
- 一个矩阵的 200TByte 还是多个矩阵的 200TByte？
- 为多少？
- 列和行之间的比率的最坏情况是什么？
- 它只是单个字符还是可以是单词？
- 如果只是单个字符，是否保证每一行的内存大小相同？
- 如果不是，如何识别新行？
有多少可用 RAM 内存？
目标计算机填满整个空闲 RAM 内存的速度有多快？
可以接受的最长时间是多少？

对原程序的问题分析

问题也是：为什么它不起作用。

该程序 ...

#include    <fstream>
#include    <string>
#include    <iostream>

int main(int argc, char* argv[]) {
    std::ifstream str2;
    str2.open ("test.data", std::ifstream::in);

    std::istreambuf_iterator<char> istart(str2);
    std::istreambuf_iterator<char> iend;
    std::istreambuf_iterator<char> iarray1 = istart;

    istart++;
    istart++;
    istart++;
    istart++;
    std::istreambuf_iterator<char> iarray2 = istart;

    std::cout  << *(iarray1);
    std::cout << std::endl;
    std::cout  << *(iarray2);
    std::cout << std::endl;
    return 0;
}

...读取 test.data 包含...

abcdefghjklm

...程序打印...

e
e

因此，循环...

while (istart != iend){
    if (x % nbcolumn == 0){
        iarray[x/nbcolumn] = istart;
    }
    istart++;
    x++;
}

...不会导致预期的结果，因为迭代器以不同的方式工作，并且每次调用...

iarray[i]++;

...同时操作所有迭代器。

权衡的解决方案3

出路是什么？根据权衡 #3 创建代码。

该程序...

#include    <iostream>
#include    <ios>
#include    <string>
#include    <fstream>

int main(int argc, char* argv[]) {
    int nbline = 3;
    int nbcolumn = 4;
    std::ifstream   fsIn;
    std::streampos  posLine[nbline];
    std::streampos  posTemp;

    fsIn.open("test.data", std::ifstream::in);
    for ( int i = 0; i < nbline; i++) {
        posLine[i] = posTemp;
        posTemp += nbcolumn;
    }

    for ( int j = 0; j < nbcolumn; j++) {
        for ( int i = 0; i < nbline; i++) {
            fsIn.seekg(posLine[i]);
            std::cout  << char(fsIn.get()) << " ";
            posLine[i] = fsIn.tellg();
        }
        std::cout << std::endl;
    }
    return 0;
}

...创建输出：

a e j
b f k
c g l
d h m

c++ - 多流迭代器 C++

5 回答 5

为什么？

Related

Reference