4

我正在构建一个 C++ CSV 数据解析器。我正在尝试访问文件的第一列和第十五列,并使用getline命令将它们读入两个数组。例如:

for(int j=0;j<i;j++)
{
    getline(posts2,postIDs[j],',');
    for(int k=0;k<14;k++)
    {
        getline(posts2,tossout,',');
    }
    getline(posts2,answerIDs[j],',');
    getline(posts2,tossout,'\r');

但是,在第一列和第十五列之间是一个带引号的列,其中包含各种逗号和松散的引号。例如:

...,"abc, defghijk. "Lmnopqrs, "tuv,""wxyz.",... <

避免此列的最佳方法是什么?我无法越过它,因为里面有引号和逗号。遇到引用后,我是否应该逐个字符地阅读引用的垃圾,直到我 按顺序找到“ ?

此外,我还看到了其他解决方案,但它们都是 Windows/Visual Studio 独有的。我正在运行 Mac OSX 版本。10.8.3 与 Xcode 3.2.3。

提前致谢!德鲁

4

2 回答 2

10

CSV 格式没有正式的标准,但首先让我们注意您引用的丑陋专栏:

"abc, defghijk. "Lmnopqrs, "tuv,"" wxyz.",

不符合被认为是 CSV 的基本规则,因为其中两个是:-

  • 1) 必须引用嵌入逗号的字段。

  • 2) 每个嵌入的双引号字符必须由一对双引号字符表示。

如果问题列遵守规则 1),那么它不遵守规则 2)。但是我们可以将其解释为遵守规则 1) - 所以我们可以说出它的结束位置 - 如果我们平衡双引号,例如

[abc, defghijk. [Lmnopqrs, ]tuv,[] wxyz.],

平衡的最外层引号包围了该列。平衡的内部报价可能缺少任何其他内部迹象,除非平衡使它们成为内部报价。

我们想要一个规则,将这个文本解析为一列,与规则 1) 一致,并且也将解析 遵守规则 2) 的列。刚刚展示的平衡表明这是可以做到的,因为遵守这两个规则的列也必然是可平衡的。

建议的规则是:

  • 列运行到前面有 0 个双引号或前面有偶数个双引号中的最后一个的第一个逗号。

如果逗号之前有任何偶数个双引号,那么我们知道我们可以平衡封闭引号并至少以一种方式平衡其余引号。

您正在考虑的更简单的规则:

遇到引用后,我是否应该逐个字符地阅读引用的垃圾,直到我按顺序找到“?

如果遇到某些确实遵守规则 2) 的列,则会失败,例如

“超级”、“豪华”、“卡车”、

更简单的规则将在 之后终止列""luxurious""。但由于此列符合规则 2),因此相邻的双引号是“转义”的双引号,没有定界意义。另一方面,建议的规则仍然正确解析列,在truck".

这是一个演示程序,其中函数get_csv_column按照建议的规则解析列:

#include <iostream>
#include <fstream>
#include <cstdlib>  

using namespace std;

/*
    Assume `in` is positioned at start of column.
    Accumulates chars from `in` as long as `in` is good
    until either:-
        - Have consumed a comma preceded by 0 quotes,or
        - Have consumed a comma immediately preceded by
        the last of an even number of quotes.
*/
std::string get_csv_column(ifstream & in)
{
    std::string col;
    unsigned quotes = 0;
    char prev = 0;
    bool finis = false;
    for (int ch; !finis && (ch = in.get()) != EOF; ) {
        switch(ch) {
        case '"':
            ++quotes;
            break;
        case ',':
            if (quotes == 0 || (prev == '"' && (quotes & 1) == 0)) {
                finis = true;
            }
            break;
        default:;
        }
        col += prev = ch;
    }
    return col;
}

int main()
{
    ifstream in("csv.txt");
    if (!in) {
        cout << "Open error :(" << endl;
        exit(EXIT_FAILURE);
    }
    for (std::string col; in; ) {
        col = get_csv_column(in),
        cout << "<[" << col << "]>" << std::endl;
    }
    if (!in && !in.eof()) {
        cout << "Read error :(" << endl;
        exit(EXIT_FAILURE);
    }
    exit(EXIT_SUCCESS);
}

它将每一列括在 中<[...]>,不考虑换行符,并在每一列中包括终端“,”:

该文件csv.txt是:

...,"abc, defghijk. "Lmnopqrs, "tuv,"" wxyz.",...,
",","",
Year,Make,Model,Description,Price,
1997,Ford,E350,"Super, ""luxurious"", truck",
1997,Ford,E350,"Super, ""luxurious"" truck",
1997,Ford,E350,"ac, abs, moon",3000.00,
1999,Chevy,"Venture ""Extended Edition""","",4900.00,
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00,
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00,

输出是:

<[...,]>
<["abc, defghijk. "Lmnopqrs, "tuv,"" wxyz.",]>
<[...,]>
<[
",",]>
<["",]>
<[
Year,]>
<[Make,]>
<[Model,]>
<[Description,]>
<[Price,]>
<[
1997,]>
<[Ford,]>
<[E350,]>
<["Super, ""luxurious"", truck",]>
<[
1997,]>
<[Ford,]>
<[E350,]>
<["Super, ""luxurious"" truck",]>
<[
1997,]>
<[Ford,]>
<[E350,]>
<["ac, abs, moon",]>
<[3000.00,]>
<[
1999,]>
<[Chevy,]>
<["Venture ""Extended Edition""",]>
<["",]>
<[4900.00,]>
<[
1999,]>
<[Chevy,]>
<["Venture ""Extended Edition, Very Large""",]>
<[,]>
<[5000.00,]>
<[
1996,]>
<[Jeep,]>
<[Grand Cherokee,]>
<["MUST SELL!
air, moon roof, loaded",]>
<[4799.00]>
于 2013-07-19T15:55:59.317 回答
0

这是 C++ 中读取 .csv 文件的最优雅的方式,该文件在带有引号(即引号)的标记内带有逗号:

std::string                           header;
std::vector<std::vector<std::string>> cSVRows;
std::ifstream                         reader(fileName);
if (reader.is_open()) {
    std::string line, column, id;
    std::getline(reader, line);
    header = line;
    while (std::getline(reader, line)) {
        std::stringstream        ss(line);
        std::vector<std::string> columns;
        bool                     withQ = false;
        std::string              part{""};
        while (std::getline(ss, column, ',')) {
            auto pos = column.find("\"");
            if (pos < column.length()) {
                withQ = !withQ;
                part += column.substr(0, pos);
                column = column.substr(pos + 1, column.length());
            }
            if (!withQ) {
                column += part;
                columns.emplace_back(std::move(column));
                part = "";
            } else {
                part += column + ",";
            }
        }
        cSVRows.emplace_back(columns);
    }
}
于 2021-10-05T17:05:35.040 回答