c++ - 从带有 C++ 分隔符的文件中提取单词（由无符号字符组成）

Question

我一直在互联网上搜索，但找不到任何现有的工具来从 C++ 中具有特定分隔符的文件中提取单词。有谁知道已经存在的 C++ 库或代码可以完成这项工作。以下是我想要实现的目标：

目标：使用分隔符从文件中提取单词
单词类型：单词可以由无符号字符的任意组合组成（在 UTF-8 编码集中）。所以，\0也应该算是一个性格。并且只有分隔符应该能够将任何两个单词彼此分开。
文件类型：文本文件

我尝试了以下代码：

#include <iostream>
using std::cout;
using std::endl;

#include <fstream>
using std::ifstream;

#include <cstring>

const int MAX_TOKENS_PER_FILE = 100000;
const int MAX_CHARS_PER_LINE = 512;
const int MAX_TOKENS_PER_LINE = 256;
const char* const DELIMITER = " ";

int main()
{
  int index = 0, keyword_num = 0;

  // stores all the words that are in a file
  unsigned char *keywords_extracted[MAX_TOKENS_PER_FILE];    

  // create a file-reading object
  ifstream fin;
  fin.open("data.txt"); // open a file
  if (!fin.good()) 
    return 1; // exit if file not found

  // read each line of the file
  while (!fin.eof())
  {
    // read an entire line into memory
    char buf[MAX_CHARS_PER_LINE];
    fin.getline(buf, MAX_CHARS_PER_LINE);

    // parse the line into blank-delimited tokens
    int n = 0; // a for-loop index

    // array to store memory addresses of the tokens in buf
    const char* token[MAX_TOKENS_PER_LINE] = {}; // initialize to 0

    // parse the line
    token[0] = strtok(buf, DELIMITER); // first token
    if (token[0]) // zero if line is blank
    {
      keywords_extracted[keyword_num] = (unsigned char *)token[0];
      keyword_num++;

      for (n = 1; n < MAX_TOKENS_PER_LINE; n++)
      {
        token[n] = strtok(0, DELIMITER); // subsequent tokens
        if (!token[n]) break; // no more tokens
            keywords_extracted[keyword_num] = (unsigned char *)token[n];
            keyword_num++;
      }
    }

  }
    // process (print) the tokens
    for(index=0;index<keyword_num;index++)
        cout << keywords_extracted[index] << endl;
}

但是我从上面的代码中遇到了一个问题：

keyword_extracted 中的第一个单词/条目被替换为“0”，因为程序读取的最后一行的内容是空的。（如果我做错/假设有任何问题，请纠正我）。

有没有办法在上面的代码中克服这个问题，或者是否有任何其他现有的库可以实现这个功能？很抱歉解释冗长，只是想清楚一点。

score 2 · Accepted Answer

std::getline接受一个分隔符（第三个参数），它可以不同于默认的 '\n' - 这对你不起作用吗？

例子;

std::string word;
while (std::getline(fin, word, '|')) {
   std::cout << word;
}

应该使用管道 (|) 作为分隔符读取和打印每个单词

c++ - 从带有 C++ 分隔符的文件中提取单词（由无符号字符组成）

1 回答 1

Related

Reference