c++ - 从内存映射格式化文件中读取整数

Question

我已经映射了一个大型格式化（文本）文件，每行包含一个整数，如下所示：

所以，我在第一个字节有一个指向内存的指针，在最后一个字节也有一个指向内存的指针。我正在尝试尽快将所有这些整数读入一个数组。最初，我创建了一个专门的 std::streambuf 类来使用 std::istream 从该内存中读取，但它似乎相对较慢。

您对如何有效地将“1231232\r\n123123\r\n123\r\n1231\r\n2387897...”之类的字符串解析为数组 {1231232,123123,1231,231,2387897,. ..} ?

文件中的整数数量事先是未知的。

score 1 · Accepted Answer

这对我来说是一项非常有趣的任务，可以让我更多地了解 C++。

承认，代码非常大并且有很多错误检查，但这仅表明在解析过程中有多少不同的事情会出错。

#include <ctype.h>
#include <limits.h>
#include <stdio.h>

#include <iterator>
#include <vector>
#include <string>

static void
die(const char *reason)
{
  fprintf(stderr, "aborted (%s)\n", reason);
  exit(EXIT_FAILURE);
}

template <class BytePtr>
static bool
read_uint(BytePtr *begin_ref, BytePtr end, unsigned int *out)
{
  const unsigned int MAX_DIV = UINT_MAX / 10;
  const unsigned int MAX_MOD = UINT_MAX % 10;

  BytePtr begin = *begin_ref;
  unsigned int n = 0;

  while (begin != end && '0' <= *begin && *begin <= '9') {
    unsigned digit = *begin - '0';
    if (n > MAX_DIV || (n == MAX_DIV && digit > MAX_MOD))
      die("unsigned overflow");
    n = 10 * n + digit;
    begin++;
  }

  if (begin == *begin_ref)
    return false;

  *begin_ref = begin;
  *out = n;
  return true;
}

template <class BytePtr, class IntConsumer>
void
parse_ints(BytePtr begin, BytePtr end, IntConsumer out)
{
  while (true) {
    while (begin != end && *begin == (unsigned char) *begin && isspace(*begin))
      begin++;
    if (begin == end)
      return;

    bool negative = *begin == '-';
    if (negative) {
      begin++;
      if (begin == end)
        die("minus at end of input");
    }

    unsigned int un;
    if (!read_uint(&begin, end, &un))
      die("no number found");

    if (!negative && un > INT_MAX)
      die("too large positive");
    if (negative && un > -((unsigned int)INT_MIN))
      die("too small negative");

    int n = negative ? -un : un;
    *out++ = n;
  }
}

static void
print(int x)
{
  printf("%d\n", x);
}

int
main()
{
  std::vector<int> result;
  std::string input("2147483647 -2147483648 0 00000 1 2 32767 4 -17 6");

  parse_ints(input.begin(), input.end(), back_inserter(result));

  std::for_each(result.begin(), result.end(), print);
  return 0;
}

我努力不调用任何类型的未定义行为，这在将无符号数转换为有符号数或调用isspace未知数据类型时会变得非常棘手。

score 0 · Accepted Answer

由于这是内存映射，因此将字符简单地复制到堆栈数组并将 atoi 复制到另一个内存映射文件之上的另一个整数数组将是非常有效的。这样，页面文件根本不用于这些大缓冲区。

open memory mapped file to output int buffer

declare small stack buffer of 20 chars
while not end of char array
  while current char not  line feed
    copy chars to stack buffer
    null terminate the buffer two chars back
    copy results of int buffer output buffer
    increment the output buffer pointer
  end while  
end while

虽然这不使用 a 库，但它的优点是可以最大限度地减少内存映射文件的内存使用量，因此临时缓冲区仅限于堆栈之一和 atoi 内部使用的堆栈。输出缓冲区可以根据需要丢弃或保存到文件中。

score 0 · Accepted Answer

注意：这个答案已经被编辑了几次。

逐行读取内存（基于链接和链接）。

class line 
{
   std::string data;
public:
   friend std::istream &operator>>(std::istream &is, line &l) 
   {
      std::getline(is, l.data);
      return is;
   }
   operator std::string() { return data; }    
};

std::streambuf osrb;
setg(ptr, ptr, ptrs + size-1);
std::istream istr(&osrb);

std::vector<int> ints;

std::istream_iterator<line> begin(istr);
std::istream_iterator<line> end;
std::transform(begin, end, std::back_inserter(ints), &boost::lexical_cast<int, std::string>);

score 0 · Accepted Answer

std::vector<int> array;
char * p = ...; // start of memory mapped block
while ( not end of memory block )
{
    array.push_back(static_cast<int>(strtol(p, &p, 10)));
    while (not end of memory block && !isdigit(*p))
        ++p;
}

这段代码有点不安全，因为不能保证它strtol会在内存映射块的末尾停止，但这是一个开始。即使添加了额外的检查，也应该非常快。

c++ - 从内存映射格式化文件中读取整数

4 回答 4

Related

Reference