c++ - C++中字符串的标记化

Question

我正在使用以下代码将每个单词拆分为每行一个 Token。我的问题出在此处：我希望持续更新文件中的令牌数量。该文件的内容是：

Student details:
Highlander 141A Section-A.
Single 450988012 SA

程序：

#include <iostream>
using std::cout;
using std::endl;

#include <fstream>
using std::ifstream;

#include <cstring>

const int MAX_CHARS_PER_LINE = 512;
const int MAX_TOKENS_PER_LINE = 20;
const char* const DELIMITER = " ";

int main()
{
  // create a file-reading object
  ifstream fin;
  fin.open("data.txt"); // open a file
  if (!fin.good()) 
    return 1; // exit if file not found

  // read each line of the file
  while (!fin.eof())
  {
    // read an entire line into memory
    char buf[MAX_CHARS_PER_LINE];
    fin.getline(buf, MAX_CHARS_PER_LINE);

    // parse the line into blank-delimited tokens
    int n = 0; // a for-loop index

    // array to store memory addresses of the tokens in buf
    const char* token[MAX_TOKENS_PER_LINE] = {}; // initialize to 0

    // parse the line
    token[0] = strtok(buf, DELIMITER); // first token
    if (token[0]) // zero if line is blank
    {
      for (n = 1; n < MAX_TOKENS_PER_LINE; n++)
      {
        token[n] = strtok(0, DELIMITER); // subsequent tokens
        if (!token[n]) break; // no more tokens
      }
    }

    // process (print) the tokens
    for (int i = 0; i < n; i++) // n = #of tokens
      cout << "Token[" << i << "] = " << token[i] << endl;
      cout << endl;
  }
}

输出：

Token[0] = Student
Token[1] = details:

Token[0] = Highlander
Token[1] = 141A
Token[2] = Section-A.

Token[0] = Single
Token[1] = 450988012
Token[2] = SA

预期的：

Token[0] = Student
Token[1] = details:

Token[2] = Highlander
Token[3] = 141A
Token[4] = Section-A.

Token[5] = Single
Token[6] = 450988012
Token[7] = SA

所以我希望它增加，以便我可以通过变量名轻松识别值。提前致谢...

score 2 · Accepted Answer

标准的惯用解决方案有什么问题：

std::string line;
while ( std::getline( fin, line ) ) {
    std::istringstream parser( line );
    int i = 0;
    std::string token;
    while ( parser >> token ) {
        std::cout << "Token[" << i << "] = " << token << std::endl;
        ++ i;
    }
}

显然，在现实生活中，您需要做的不仅仅是输出每个标记，还需要更复杂的解析。但是，无论何时您进行面向行的输入，以上都是您应该使用的模型（可能还要跟踪行号，以获取错误消息）。

可能值得指出的是，在这种情况下，更好的解决方案是boost::split在外循环中使用，以获取标记向量。

score 0 · Accepted Answer

我会让 iostream 进行拆分

std::vector<std::string> token;
std::string s;
while (fin >> s)
    token.push_back(s);

然后，您可以使用适当的索引一次输出整个数组。

for (int i = 0; i < token.size(); ++i)
    cout << "Token[" << i << "] = " << token[i] << endl;

更新：

您甚至可以完全省略向量并在从输入流中读取标记时输出它们

std::string s;
for (int i = 0; fin >> s; ++i)
    std::cout << "Token[" << i << "] = " << token[i] << std::endl;

c++ - C++中字符串的标记化

2 回答 2

Related

Reference