2

我有超过 100,000 个以下格式的 csv 文件:

1,1,5,1,1,1,0,0,6,6,1,1,1,0,1,0,13,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,1,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,2,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,3,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,4,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,5,6,4,1,0,1,0,1,0,4,8,18,20,,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,6,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,7,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,8,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,2,0,0,12,12,1,2,4,1,1,0,13,4,7,8,18,20,21,25,27,29,31,32,,,,,,,,,,,,,,,,

我需要的只是字段 10 和字段 17,字段 10 是计数器,表示从字段 17 开始存储了多少整数,即我需要的是:

6,13,4,7,8,18,20
5,4,7,8,18,20
5,4,7,8,18,20
5,13,4,7,8,20
5,13,4,7,8,20
4,4,8,18,20
5,4,7,8,18,20
5,13,4,7,8,20
5,13,4,7,8,20
12,13,4,7,8,18,20,21,25,27,29,31,32

需要读取的最大整数数是 28。我可以通过 C++ 中的 Getline 轻松实现这一点,但是,根据我以前的经验,因为我需要处理超过 100,000 个这样的文件,每个文件可能有 300,000~400,000 行这样的行。因此,使用 Getline 读取数据并构建向量> 对我来说可能存在严重的性能问题。我尝试使用 fscanf 来实现这一点:

while (!feof(stream)){
 fscanf(fstream,"%*d,%*d,%*d,%*d,%*d,%*d,%*d,%*d,%*d,%d",&MyCounter);
 fscanf(fstream,"%*d,%*d,%*d,%*d,%*d,%*d"); // skip to column 17
 for (int i=0;i<MyCounter;i++){
  fscanf(fstream,"%d",&MyIntArr[i]);
 }
 fscanf(fstream,"%*s"); // to finish the line
}

但是,这将多次调用 fscanf 并且还可能产生性能问题。有没有办法在 fscanf 的 1 次调用中读取可变数量的整数?或者我需要读入一个字符串然后 strsep/stoi 它?与 fscanf 相比,从性能角度来看,哪个更好?

4

4 回答 4

2

因此,每行最多有 43 个数字。即使是 64 位,每个数字也被限制为 21 位,因此 1024 字节对于一行可能的最大 946 字节来说已经足够了(只要没有空格)。

char line[1024];

while (fgets(line, sizeof(line), stdin) != NULL) {
    //...
}

跳转到所需列的辅助函数。

const char *find_nth_comma(const char *s, int n) {
    const char *p = s;
    if (p && n) while (*p) {
        if (*p == ',') {
            if (--n == 0) break;
        }
        ++p;
    }
    return p;
}

因此,在您的循环中,跳到第 10 列以找到第一个感兴趣的数字,然后跳到第 17 列开始阅读其余数字。完成的循环如下所示:

while (fgets(line, sizeof(line), stdin) != NULL) {
    const char *p = find_nth_comma(line, 9);
    char *end;
    assert(p && *p);
    MyCounter = strtol(p+1, &end, 10);
    assert(*end == ',');
    p = find_nth_comma(end+1, 6);
    assert(p && *p);
    for (int i = 0; i < MyCounter; ++i, p = end) {
        MyIntArray[i] = strtol(p+1, &end, 10);
        assert((*end == ',') ||
               (i == MyCounter-1) &&
               (*end == '\0' || isspace(*end & 0xFF)));
    }
}

这种方法也适用于mmap解决方案。将fgets替换为指向文件中要处理的下一行的函数。find_nth_comma需要进行修改以检测行尾/文件尾,而不是依赖于 NUL 终止的字符串。strtol将使用再次检测行尾或文件尾的自定义函数进行更改。(此类更改的目的是删除任何需要复制数据的代码,这将成为一种mmap方法的动机。)

通过并行处理,可以同时解析文件的多个部分。但是,让不同的线程处理不同的文件可能就足够了,然后在处理完所有文件后整理结果。

于 2018-04-25T21:29:24.947 回答
2

最终我使用内存映射文件来解决我的问题(这个解决方案是我之前的问题的副产品,读取大 CSV 文件时的性能问题) 在 C++ 中读取大型 CSV 文件性能问题

由于我在 MS Windows 上工作,所以我使用 Stephan Brumme 的“便携式内存映射 C++ 类” http://create.stephan-brumme.com/portable-memory-mapping/ 因为我不需要处理文件> 2 GB,我的实现更简单。对于超过 2GB 的文件,请访问网络以了解如何处理。

请在下面找到我的一段代码:

// may tried RandomAccess/SequentialScan
MemoryMapped MemFile(FilterBase.BaseFileName, MemoryMapped::WholeFile, MemoryMapped::RandomAccess);

// point to start of memory file
char* start = (char*)MemFile.getData();
// dummy in my case
char* tmpBuffer = start;

// looping counter
uint64_t i = 0;

// pre-allocate result vector
MyVector.resize(300000);

// Line counter
int LnCnt = 0;

//no. of field
int NumOfField=43;
//delimiter count, num of field + 1 since the leading and trailing delimiter are virtual
int DelimCnt=NoOfField+1;
//Delimiter position. May use new to allocate at run time
// or even use vector of integer
// This is to store the delimiter position in each line
// since the position is relative to start of file. if file is extremely
// large, may need to change from int to unsigner, long or even unsigned long long
static  int DelimPos[DelimCnt];

// Max number of field need to read usually equal to NumOfField, can be smaller, eg in my case, I only need 4 fields
// from first 15 field, in this case, can assign 15 to MaxFieldNeed
int MaxFieldNeed=NumOfField;
// keep track how many comma read each line
int DelimCounter=0;
// define field and line seperator
char FieldDelim=',';
char LineSep='\n';

// 1st field, "virtual Delimiter" position
DelimPos[CommaCounter]=-1
DelimCounter++;

// loop through the whole memory field, 1 and only once
for (i = 0; i < MemFile.size();i++)
{
  // grab all position of delimiter in each line
  if ((MemFile[i] == FieldDelim) && (DelimCounter<=MaxFieldNeed)){
    DelimPos[DelimCounter] = i;
    DelimCounter++;
  };

  // grab all values when end of line hit
  if (MemFile[i] == LineSep) {
    // no need to use if (DelimCounter==NumOfField) just assign anyway, waste a little bit
    // memory in integer array but gain performance 
    DelimPos[DelimCounter] = i;
    // I know exactly what the format is and what field(s) I want
    // a more general approach (as a CSV reader) may put all fields
    // into vector of vector of string
    // With *EFFORT* one may modify this piece of code so that it can parse
    // different format at run time eg similar to:
    // fscanf(fstream,"%d,%f....
    // also, this piece of code cannot handle complex CSV e.g.
    // Peter,28,157CM
    // John,26,167CM
    // "Mary,Brown",25,150CM
    MyVector.StrField = string(strat+DelimPos[0] + 1, strat+DelimPos[1] - 1);
    MyVector.IntField = strtol(strat+DelimPos[3] + 1,&tmpBuffer,10);
    MyVector.IntField2 = strtol(strat+DelimPos[8] + 1,&tmpBuffer,10);
    MyVector.FloatField = strtof(start + DelimPos[14] + 1,&tmpBuffer);
    // reset Delim counter each line
    DelimCounter=0
    // previous line seperator treat as first delimiter of next line
    DelimPos[DelimCounter] = i;
    DelimCounter++
    LnCnt++;
  }
}
MyVector.resize(LnCnt);
MyVector.shrink_to_fit();
MemFile.close();
};

我可以在里面编写任何我想要的代码:

  if (MemFile[i] == LineSep) {
  }

例如处理空字段,执行计算等。使用这段代码,我在 57 秒内处理了 2100 个文件(6.3 GB)!(我在其中编码了 CSV 格式,并且在我之前的案例中只获取了 4 个值)。稍后将更改此代码以处理此问题。感谢所有在这个问题上帮助我的人。

于 2018-04-26T14:55:31.987 回答
1

读取运行时确定的整数数量的最简单方法可能是指向较长格式字符串的右侧部分。换句话说,我们可以有一个包含 28个%d,说明符的格式字符串,但指向字符串结尾之前的第n个,并将该指针作为格式字符串传递给scanf().

作为一个简单的例子,考虑接受最多 6 个整数中的 3 个:

"%d,%d,%d,%d,%d,%d,"
          ^

箭头显示用作模式参数的字符串指针。


这是一个完整的示例;当使用gcc -O3. 更新输入字符串指针的机制稍微复杂一些,这在从文件流读取时显然不是必需的。我跳过了检查nfields <= 28,但这很容易添加。

char const *const input =
    "1,1,5,1,1,1,0,0,6,6,1,1,1,0,1,0,13,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,\n"
    "1,1,5,1,1,1,0,1,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,\n"
    "1,1,5,1,1,1,0,2,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,\n"
    "1,1,5,1,1,1,0,3,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,\n"
    "1,1,5,1,1,1,0,4,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,\n"
    "1,1,5,1,1,1,0,5,6,4,1,0,1,0,1,0,4,8,18,20,,,,,,,,,,,,,,,,,,,,,,,,\n"
    "1,1,5,1,1,1,0,6,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,\n"
    "1,1,5,1,1,1,0,7,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,\n"
    "1,1,5,1,1,1,0,8,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,\n"
    "1,1,5,1,1,2,0,0,12,12,1,2,4,1,1,0,13,4,7,8,18,20,21,25,27,29,31,32,,,,,,,,,,,,,,,,\n";

#include <stdio.h>

#define SKIP_FIELD "%*[^,],"
#define DECIMAL_FIELD "%d,"

int read()
{
    int n;             /* bytes read - not needed for file or stdin */
    int sum = 0;       /* just to make sure results are used */

    for (char const *s = input;  *s;  ) {
        int nfields;
        int array[28];
        int m = sscanf(s,
                       /* field 0 is missing */
                       SKIP_FIELD SKIP_FIELD SKIP_FIELD
                       SKIP_FIELD SKIP_FIELD SKIP_FIELD
                       SKIP_FIELD SKIP_FIELD SKIP_FIELD
                       DECIMAL_FIELD /* field 10 */
                       SKIP_FIELD SKIP_FIELD SKIP_FIELD
                       SKIP_FIELD SKIP_FIELD SKIP_FIELD
                       "%n",
                       &nfields,
                       &n);
        if (m != 1) {
            return -1;
        }

        s += n;

        static const char fieldchars[] = DECIMAL_FIELD;
        static const size_t fieldsize = sizeof fieldchars - 1; /* ignore terminating null */

        static const char *const parse_entries =
            DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD
            DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD
            DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD
            DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD
            DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD
            DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD
            DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD
            "[^\n] ";
        const char *const line_parse = parse_entries + (28-nfields) * fieldsize;

        /* now read nfields (max 28) */
        m = sscanf(s,
                   line_parse,
                   &array[0], &array[1], &array[2], &array[3],
                   &array[4], &array[5], &array[6], &array[7],
                   &array[8], &array[9], &array[10], &array[11],
                   &array[12], &array[13], &array[14], &array[15],
                   &array[16], &array[17], &array[18], &array[19],
                   &array[20], &array[21], &array[22], &array[23],
                   &array[24], &array[25], &array[26], &array[27]);
        if (m != nfields) {
            return -1;
        }

        /* advance stream position */
        sscanf(s, "%*[^\n] %n", &n);  s += n;

        /* use the results */
        for (int i = 0;  i < nfields;  ++i) {
            sum += array[i];
        }
    }
    return sum;
}

#undef SKIP_FIELD
#undef DECIMAL_FIELD

int main()
{
    int sum = 0;
    for (int i = 0;  i < 1000000;  ++i) {
        sum += read() * (i&1 ? 1 : - 1); /* alternate add and subtract */
    }
    return sum != 0;
}
于 2018-04-26T08:53:49.473 回答
1

为了最大限度地提高性能,您应该使用mmap或等效代码映射内存中的文件,并使用临时代码解析文件,通常使用指针一次扫描一个字符,检查'\n'和/或'\r'记录结尾并将数字转换为存储到您的阵列的苍蝇。棘手的部分是:

  • 您如何分配或以其他方式处理目标数组。
  • 字段都是数字的吗?不可缺少的?
  • 最后一条记录是否被换行符终止?mmap您可以在通话后轻松检查此情况。优点是您只需要在遇到换行符序列时检查文件结尾。
于 2018-04-26T08:32:00.793 回答