c++ - 我认为 STL 导致我的应用程序的内存使用量增加了三倍

Question

我在我的应用程序中输入了一个 200mb 的文件，由于一个非常奇怪的原因，我的应用程序的内存使用量超过了 600mb。我已经尝试过vector和deque，以及std::string和char *，但无济于事。我需要我的应用程序的内存使用与我正在阅读的文件几乎相同，任何建议都会非常有帮助。是否存在导致如此多内存消耗的错误？你能查明问题还是我应该重写整个事情？

Windows Vista SP1 x64, Microsoft Visual Studio 2008 SP1, 32Bit Release Version, Intel CPU

到目前为止的整个应用程序：

#include <string>
#include <vector>
#include <iostream>
#include <iomanip>
#include <fstream>
#include <sstream>
#include <iterator>
#include <algorithm>
#include <time.h>



static unsigned int getFileSize (const char *filename)
{
    std::ifstream fs;
    fs.open (filename, std::ios::binary);
    fs.seekg(0, std::ios::beg);
    const std::ios::pos_type start_pos = fs.tellg();
    fs.seekg(0, std::ios::end);
    const std::ios::pos_type end_pos = fs.tellg();
    const unsigned int ret_filesize (static_cast<unsigned int>(end_pos - start_pos));
    fs.close();
    return ret_filesize;
}
void str2Vec (std::string &str, std::vector<std::string> &vec)
{
    int newlineLastIndex(0);
    for (int loopVar01 = str.size(); loopVar01 > 0; loopVar01--)
    {
        if (str[loopVar01]=='\n')
        {
            newlineLastIndex = loopVar01;
            break;
        }
    }
    int remainder(str.size()-newlineLastIndex);

    std::vector<int> indexVec;
    indexVec.push_back(0);
    for (unsigned int lpVar02 = 0; lpVar02 < (str.size()-remainder); lpVar02++)
    {
        if (str[lpVar02] == '\n')
        {
            indexVec.push_back(lpVar02);
        }
    }
    int memSize(0);
    for (int lpVar03 = 0; lpVar03 < (indexVec.size()-1); lpVar03++)
    {
        memSize = indexVec[(lpVar03+1)] - indexVec[lpVar03];
        std::string tempStr (memSize,'0');
        memcpy(&tempStr[0],&str[indexVec[lpVar03]],memSize);
        vec.push_back(tempStr);
    }
}
void readFile(const std::string &fileName, std::vector<std::string> &vec)
{
    static unsigned int fileSize = getFileSize(fileName.c_str());
    static std::ifstream fileStream;
    fileStream.open (fileName.c_str(),std::ios::binary);
    fileStream.clear();
    fileStream.seekg (0, std::ios::beg);
    const int chunks(1000); 
    int singleChunk(fileSize/chunks);
    int remainder = fileSize - (singleChunk * chunks);
    std::string fileStr (singleChunk, '0');
    int fileIndex(0);
    for (int lpVar01 = 0; lpVar01 < chunks; lpVar01++)
    {
        fileStream.read(&fileStr[0], singleChunk);
        str2Vec(fileStr, vec);
    }
    std::string remainderStr(remainder, '0');
    fileStream.read(&remainderStr[0], remainder);
    str2Vec(fileStr, vec);      
}
int main (int argc, char *argv[])
{   
        std::vector<std::string> vec;
        std::string inFile(argv[1]);
        readFile(inFile, vec);
}

score 5 · Accepted Answer

你的记忆正在支离破碎。

尝试这样的事情：

  HANDLE heaps[1025];
  DWORD nheaps = GetProcessHeaps((sizeof(heaps) / sizeof(HANDLE)) - 1, heaps);

  for (DWORD i = 0; i < nheaps; ++i) 
  {
    ULONG  HeapFragValue = 2;
    HeapSetInformation(heaps[i],
                       HeapCompatibilityInformation,
                       &HeapFragValue,
                       sizeof(HeapFragValue));
  }

score 3 · Accepted Answer

如果我没看错的话，最大的问题是该算法会自动将所需内存翻倍。

在 ReadFile() 中，您将整个文件读入一组“singleChunk”大小的字符串（块），然后在 str2Vec() 的最后一个循环中，为块的每个换行符分隔段分配一个临时字符串。所以你在那里将内存加倍。

您还遇到了速度问题 - str2vec 对块进行了 2 次遍历以找到所有换行符。没有理由你不能做到这一点。

score 2 · Accepted Answer

STL 容器的存在是为了抽象出内存操作。如果你有一个硬内存限制，那么你就不能真正将它们抽象出来。

我建议使用mmap()在（或在 Windows 中MapViewOfFile()）中读取文件。

score 2 · Accepted Answer

您可以做的另一件事是将整个文件加载到一个内存块中。然后制作一个指向每行第一个字符的指针向量，同时用 \0 替换换行符，使其以空值结尾。（当然假设你的字符串不应该有 \0 。）

它不一定像拥有一个字符串向量那样方便，但拥有一个 const char* 向量可能“一样好”。

score 1 · Accepted Answer

在 readFile 中，您至少有 2 个文件副本 - ifstream 和复制到 std::vector 中的数据。只要您打开文件，并且像以前一样复制它，就很难将总内存占用降低到文件大小的两倍以下。

score 1 · Accepted Answer

不要使用 std::list。它需要比向量更多的内存。
vector 执行所谓的“加倍”，即，当空间不足时，它分配的内存是当前内存的两倍。为了避免它，您可以使用 std::vector:: reserve () 方法，如果我没记错的话，您可以使用 std::vector:: capacity () 方法检查它（注意 capacity() >= size() ）。

由于在执行过程中不知道行数，我看不到简单的算法可以避免“加倍”问题。根据 slavy13.myopenid.com 的评论，解决方案是在完成阅读后将信息移动到另一个保留的向量（相关问题是如何缩小 std::vector？）。

score 1 · Accepted Answer

首先，您如何确定内存使用情况？任务管理器不是一个合适的工具，因为它显示的实际上并不是内存使用情况。

其次，除了您的（出于某种原因？）静态变量之外，在您完成读取文件时唯一没有被释放的数据是向量。所以测试它的容量，测试它包含的每个字符串的容量。找出他们每个人使用了多少内存。您拥有确定内存使用位置的工具。

score 1 · Accepted Answer

我认为您尝试编写自己的缓冲策略是错误的。

流已经实现了非常好的缓冲策略。如果您认为需要更大的缓冲区，您可以将基本缓冲区安装到流中，而无需任何额外的代码来控制缓冲区。

这是我想出的： NB 用我在网上找到的“国王詹姆斯圣经”的文本版本进行了测试。

#include <string>
#include <vector>
#include <list>
#include <fstream>
#include <algorithm>
#include <iterator>
#include <iostream>

class Line: public std::string
{
};

std::istream& operator>>(std::istream& in,Line& line)
{
    // Relatively efficient way to copy a line into a string.
    return std::getline(in,line);
}
std::ostream& operator<<(std::ostream& out,Line const& line)
{
    return out << static_cast<std::string const&>(line) << "\n";
}

void readLinesFromStream(std::istream& stream,std::vector<Line>& lines)
{
    /*
     * Read into a list as this is flexible in memory usage and will not
     * allocate huge chunks of un-required space.
     *
     * Even with huge files the space for list will be insignificant
     * compared to the size of the data.
     *
     * This then allows us to reserve the correct size of the vector
     * Thus avoiding huge memory chunks being prematurely allocated that
     * are not required. It also prevents the internal structure from
     * being copied every time the container is re-sized.
     */
    std::list<Line>     data;
    std::copy(  std::istream_iterator<Line>(stream),
                std::istream_iterator<Line>(),
                std::inserter(data,data.end())
             );

    /*
     * Reserve the correct size in the vector.
     * then copy out of the list into the vector
     */
    lines.reserve(data.size());
    std::copy(  data.begin(),
                data.end(),
                std::back_inserter(lines)
             );
}

void readLinesFromFile(std::string const& name,std::vector<Line>& lines)
{
    /*
     * Set up the file stream and override the default buffer used by the stream.
     * Make it big because we think the istream buffer is insufficient!!!!
     */
    std::ifstream       file;
    std::vector<char>   buffer(10000);
    file.rdbuf()->pubsetbuf(&buffer[0],buffer.size());

    file.open(name.c_str());
    readLinesFromStream(file,lines);
}


int main(int argc,char* argv[])
{
    std::vector<Line>   lines;
    readLinesFromFile(argv[1],lines);

    // Un-comment if your file is larger than 1100 lines.

    // I tested with a copy of the King James bible. 
    // std::cout << "Lines: " << lines.size() << "\n";
    // std::copy(lines.begin() + 1000,lines.begin() + 1100,std::ostream_iterator<Line>(std::cout));
}

score 0 · Accepted Answer

尝试使用列表而不是向量。向量（几乎总是）在内存中是线性的。

诚然，你有字符串，这些字符串（几乎总是）在修改时复制，引用计数应该减少这个问题，但它可能会有所帮助。

score 0 · Accepted Answer

我不知道这是否相关，因为我真的不知道您的文件是什么样的。

但是您应该知道，在存储非常短的字符串时，std::string 可能会产生相当大的空间开销。而且，如果您为非常短的字符串单独更新 char*，您还将看到所有分配块开销。

您将多少个字符串放入该向量中，它们的平均长度是多少？

score 0 · Accepted Answer

也许您应该详细说明为什么需要读取内存中的整个文件，我怀疑可能有一种方法可以做您想做的事情，而无需一次将整个文件读入内存。如果您真的需要此功能，请查看内存映射文件，这可能比您编写等效文件更有效率。然后，您的内部数据结构可以在文件中使用偏移量。顺便说一句，一定要看看你是否需要处理字符编码。

score 0 · Accepted Answer

您应该知道，因为您声明fileStream为static，所以它永远不会超出范围，这意味着文件直到执行的最后一刻才关闭。这肯定会涉及一些记忆。您可以在最后一次之前明确关闭它str2Vec以尝试帮助解决这种情况。

此外，您多次打开和关闭同一个文件，只需打开一次并通过引用传递它（如果需要，重置状态）。虽然我想你可以通过文件的单次传递来实现你需要的东西。

哎呀，我怀疑您是否真的需要像在这里一样知道文件大小，您可以只读取大小“块”的数量，直到您进行短读（此时您已完成）。

你为什么不解释代码的目标，我觉得有一个更简单的解决方案可能。

score 0 · Accepted Answer

我发现做行的最好方法是只读内存映射文件。不要为 \n 写 \0，而是使用成对的const char *s，likestd::pair<const char*, const char*>或成对的const char*s 和一个计数。如果您需要编辑这些行，一个好方法是创建一个可以存储指针的对象对或带有修改后的行的 std::string。

至于用 STL 向量或双端队列节省内存空间，一个好的技术是让它加倍，直到你完成添加。然后将其调整为实际大小，这应该将未使用的内存释放回堆分配器。内存可能仍会分配给程序，尽管我不担心。此外，不要采用默认大小，而是从获取文件大小（以字节为单位）开始，除以您对每行平均字符的最佳猜测，并在开始时保留那么多空间。

score -1 · Accepted Answer

通过 pushBack() 增加向量会导致内存碎片和内存使用效率低下。我会尝试使用列表，并且只有在您确切知道它需要多少元素时才创建一个向量（如果您需要的话）。

c++ - 我认为 STL 导致我的应用程序的内存使用量增加了三倍

14 回答 14

Related

Reference