c++ - 使用 C++ 解析一个巨大的复杂 CSV 文件

Question

我有一个大的 CSV 文件，如下所示：

23456，末日即将来临，没有意义的愚蠢描述，http ://www.example.com，45332，1998 年 7 月 5 日，星期日，45.332

这只是 CSV 文件的一行。其中大约有 500k。

我想使用 C++ 解析这个文件。我开始的代码是：

#include <iostream>
#include <fstream>
#include <string>
#include <sstream>

using namespace std;

int main()
{
    // open the input csv file containing training data
    ifstream inputFile("my.csv");

    string line;

    while (getline(inputFile, line, ','))
    {
        istringstream ss(line);

        // declaring appropriate variables present in csv file
        long unsigned id;
        string url, title, description, datetaken;
        float val1, val2;

        ss >> id >> url >> title >> datetaken >> description >> val1 >> val2;

        cout << url << endl;
    }
    inputFile.close();
}

问题是它没有打印出正确的值。

我怀疑它无法处理字段中的空格。那你建议我应该怎么做？

谢谢

score 4 · Accepted Answer

在这个例子中，我们必须使用两个来解析字符串getline。第一个getline(cin, line)使用默认换行符获取一行 cvs 文本。第二个getline(ss, line, ',')使用逗号分隔字符串。

#include <iostream>
#include <sstream>
#include <string>
#include <vector>

float get_float(const std::string& s) { 
    std::stringstream ss(s);
    float ret;
    ss >> ret;
    return ret;
}


int get_int(const std::string& s) { 
    std::stringstream ss(s);
    int ret;
    ss >> ret;
    return ret;
}

int main() {
    std::string line;
    while (getline(cin, line)) {
        std::stringstream ss(line);
        std::vector<std::string> v;
        std::string field;
        while(getline(ss, field, ',')) {
            std::cout << " " << field;
            v.push_back(field);
        }
        int id = get_int(v[0]);
        float f = get_float(v[6]);
        std::cout << v[3] << std::endl;
    }
}

score 1 · Accepted Answer

在规定的限制范围内，我想我会做这样的事情：

#include <locale>
#include <iostream>
#include <sstream>
#include <string>
#include <vector>
#include <iterator>

// A ctype that classifies only comma and new-line as "white space":
struct field_reader : std::ctype<char> {

    field_reader() : std::ctype<char>(get_table()) {}

    static std::ctype_base::mask const* get_table() {
        static std::vector<std::ctype_base::mask>
            rc(table_size, std::ctype_base::mask());

        rc[','] = std::ctype_base::space;
        rc['\n'] = std::ctype_base::space;
        return &rc[0];
    }
};

// A struct to hold one record from the file:
struct record {
    std::string key, name, desc, url, zip, date, number;

    friend std::istream &operator>>(std::istream &is, record &r) {
        return is >> r.key >> r.name >> r.desc >> r.url >> r.zip >> r.date >> r.number;
    }

    friend std::ostream &operator<<(std::ostream &os, record const &r) {
        return os << "key: " << r.key
            << "\nname: " << r.name
            << "\ndesc: " << r.desc
            << "\nurl: " << r.url
            << "\nzip: " << r.zip
            << "\ndate: " << r.date
            << "\nnumber: " << r.number;
    }
};

int main() {
    std::stringstream input("23456, The End is Near, A silly description that makes no sense, http://www.example.com, 45332, 5th July 1998 Sunday, 45.332");

    // use our ctype facet with the stream:
    input.imbue(std::locale(std::locale(), new field_reader()));

    // read in all our records:
    std::istream_iterator<record> in(input), end;
    std::vector<record> records{ in, end };

    // show what we read:
    std::copy(records.begin(), records.end(),
              std::ostream_iterator<record>(std::cout, "\n"));

}

毫无疑问，这比大多数其他的都要长——但它都被分解成小的、大部分可重复使用的部分。一旦你有了其他部分，读取数据的代码就很简单了：

    std::vector<record> records{ in, end };

还有一点我觉得很引人注目：第一次编译代码时，它也能正确运行（我发现这种编程风格很常规）。

score 1 · Accepted Answer

使用重载的插入运算符std::istream来读取std::strings不会很好用。整行是一个字符串，因此默认情况下不会发现字段发生变化。一个快速的解决方法是拆分line逗号并将值分配给适当的字段（而不是使用std::istringstream）。

注意：除了 jrok 关于std::getline

score 0 · Accepted Answer

我刚刚为自己解决了这个问题，并愿意分享！
这可能有点矫枉过正，但它展示了 Boost Tokenizer 和向量如何处理一个大问题的工作示例。

/*
 * ALfred Haines Copyleft 2013
 * convert csv to sql file
 * csv2sql requires that each line is a unique record
 *
 * This example of file read and the Boost tokenizer
 *
 * In the spirit of COBOL I do not output until the end
 * when all the print lines are ouput at once
 * Special thanks to SBHacker for the code to handle linefeeds
*/
#include <sstream>
#include <boost/tokenizer.hpp>
#include <boost/iostreams/device/file.hpp>
#include <boost/iostreams/stream.hpp>
#include <boost/algorithm/string/replace.hpp>
#include <vector>

namespace io = boost::iostreams;
using boost::tokenizer;
using boost::escaped_list_separator;
typedef tokenizer<escaped_list_separator<char> > so_tokenizer;

using namespace std;
using namespace boost;

vector<string> parser( string );


int main()
{
vector<string> stuff ; // this is the data in a vector
string filename; // this is the input file
string c = ""; // this holds the print line
string sr ;

cout << "Enter filename: " ;
cin >> filename;
//filename = "drwho.csv";
int lastindex = filename.find_last_of("."); // find where the extension begins
string rawname = filename.substr(0, lastindex); // extract the raw name

stuff = parser( filename ); // this gets the data from the file

/** I ask if the user wants a new_index to be created */
cout << "\n\nMySql requires a unique ID field as a Primary Key \n" ;
cout << "If the first field is not unique (no dupicate entries) \nthan you should create a " ;
cout << "New index field for this data.\n" ;
cout << "Not Sure! try no first to maintain data integrity.\n" ;
string ni ;bool invalid_data = true;bool new_index = false ;
    do {
        cout<<"Should I create a New Index now? (y/n)"<<endl;
        cin>>ni;
    if ( ni  == "y" || ni  == "n" ) { invalid_data =false ;  }
        } while (invalid_data);
    cout << "\n" ;
if (ni  == "y" )
{
  new_index = true ;
  sr = rawname.c_str() ; sr.append("_id" ); // new_index field
}

// now make the sql code from the vector stuff
// Create table section
c.append("DROP TABLE IF EXISTS `");
c.append(rawname.c_str() );
c.append("`;");
c.append("\nCREATE TABLE IF NOT EXISTS `");
c.append(rawname.c_str() );
c.append( "` (");
c.append("\n");
if (new_index)
{
c.append( "`");
c.append(sr );
c.append( "`  int(10) unsigned NOT NULL,");
c.append("\n");
}

string s = stuff[0];// it is assumed that line zero has fieldnames

int x =0 ; // used to determine if new index is printed

// boost tokenizer code from the Boost website -- tok holds the token
so_tokenizer tok(s, escaped_list_separator<char>('\\', ',', '\"'));
for(so_tokenizer::iterator beg=tok.begin(); beg!=tok.end(); ++beg)
  {
    x++; // keeps number of fields for later use to eliminate the comma on the last entry
    if (x == 1 && new_index == false ) sr = static_cast<string> (*beg) ;
    c.append( "`" );
    c.append(*beg);
    if (x == 1 && new_index == false )
    {
      c.append( "`  int(10) unsigned NOT NULL,");
    }
    else
    {
    c.append("`  text ,");
    }
    c.append("\n");
    }
c.append("PRIMARY KEY (`");
c.append(sr );
c.append("`)" );
c.append("\n");
c.append( ") ENGINE=InnoDB DEFAULT CHARSET=latin1;");
c.append("\n");
c.append("\n");
// The Create table section is done

// Now make the Insert lines one per line is safer in case you need to split the sql file
for (int w=1; w < stuff.size(); ++w)
  {
    c.append("INSERT INTO `");
    c.append(rawname.c_str() );
    c.append("` VALUES (  ");
if (new_index)
{
    string String = static_cast<ostringstream*>( &(ostringstream() << w) )->str();
    c.append(String);
    c.append(" , ");
}
    int p = 1 ; // used to eliminate the comma on the last entry
    // tokenizer code needs unique name -- stok holds this token
    so_tokenizer stok(stuff[w], escaped_list_separator<char>('\\', ',', '\"'));
    for(so_tokenizer::iterator beg=stok.begin(); beg!=stok.end(); ++beg)
    {
      c.append(" '");
      string str = static_cast<string> (*beg) ;
      boost::replace_all(str, "'", "\\'");
//      boost::replace_all(str, "\n", " -- ");
      c.append( str);
      c.append("' ");
      if ( p < x ) c.append(",")  ;// we dont want a comma on the last entry
      p++ ;
    }
    c.append( ");\n");
  }

// now print the whole thing to an output file
string out_file = rawname.c_str() ;
out_file.append(".sql");
io::stream_buffer<io::file_sink> buf(out_file);
std::ostream out(&buf);
out << c ;

// let the user know that they are done
cout<< "Well if you got here then the data should be in the file " << out_file << "\n" ;

return 0;}

vector<string> parser( string filename )
{
    typedef tokenizer< escaped_list_separator<char> > Tokenizer;
    escaped_list_separator<char> sep('\\', ',', '\"');
    vector<string> stuff ;
    string data(filename);
    ifstream in(filename.c_str());
    string li;
    string buffer;
    bool inside_quotes(false);
    size_t last_quote(0);
    while (getline(in,buffer))
    {
        // --- deal with line breaks in quoted strings
        last_quote = buffer.find_first_of('"');
        while (last_quote != string::npos)
        {
            inside_quotes = !inside_quotes;
            last_quote = buffer.find_first_of('"',last_quote+1);
        }
        li.append(buffer);
        if (inside_quotes)
        {
            li.append("\n");
            continue;
        }
        // ---
        stuff.push_back(li);
        li.clear(); // clear here, next check could fail
    }
    in.close();
    //cout << stuff.size() << endl ;
    return stuff ;

}

score 0 · Accepted Answer

您怀疑您的代码没有按预期运行是正确的，因为字段值中的空格。

如果您确实有“简单”的 CSV，其中没有字段可能在字段值中包含逗号，那么我将远离流运算符，也许是 C++。问题中的示例程序仅重新排序字段。没有必要将这些值实际解释或转换为适当的类型（除非验证也是一个目标）。使用 awk单独重新排序非常容易完成。例如，以下命令将反转简单 CSV 文件中的 3 个字段。

cat infile | awk -F, '{ print $3","$2","$1 }' > outfile

如果目标真的是将此代码片段用作启动板以获得更大更好的想法......那么我会通过搜索逗号来标记该行。std::string 类具有查找偏移特定字符的内置方法。您可以根据需要使这种方法变得优雅或不优雅。最优雅的方法最终看起来像 boost 标记化代码。

快速而简单的方法是只知道您的程序有 N 个字段并查找相应的 N-1 个逗号的位置。一旦你有了这些位置，调用 std::string::substr 来提取感兴趣的字段就很简单了。

c++ - 使用 C++ 解析一个巨大的复杂 CSV 文件

5 回答 5

Related

Reference