java - 计算文本文件中的单词时跳过字符串的前几个单词

Question

我正在尝试计算具有以下格式的文本文件中的字数：

TITEL####URL####ABSTRACT\n
TITEL####URL####ABSTRACT\n
TITEL####URL####ABSTRACT\n

像这样：

 Available line####http://en.wikipedia.org/wiki/Available_line####In voice,
 Marwan al-Shehhi####http://en.wikipedia.org/wiki/Marwan_al-Shehhi####Marwan etc.
 Theodore Beza####http://en.wikipedia.org/wiki/Theodore_Beza####Theodore Beza etc.

我计算单词的代码如下所示：

    public static int countTotalWords() {
    totalWords = 0;

    try {
        FileInputStream fis;
        fis = new FileInputStream(fileName);


        Scanner scan = new Scanner(fis);

        while (scan.hasNext()) {
            totalWords++;
            scan.next();
        }
    } catch (FileNotFoundException ex) {
        Logger.getLogger(Opgave1.class.getName()).log(Level.SEVERE, null, ex);
    }
    return totalWords;
}

我假设它有效...

我只想计算摘要中的单词，因此忽略标题和 URL。我猜#### 可以用来跳过每一行的第一部分，但对于我的生活，我无法弄清楚如何。任何帮助表示赞赏！

score 1 · Accepted Answer

您可以拆分字符串：

String s = "TITEL####URL####ABSTRACT\n";
String[] tokens = s.split("#+");
String abstractText = tokens[2];

然后计算单词，您可以进一步拆分：

int count = abstractText.split("\\s+").length;

注意：如果您使用 Java 7+ 并且您的文件不是太大，您也可以使用以下方式阅读它：

List<String> lines = Files.readAllLines(file, charset);

score 0 · Accepted Answer

假设您已经修复了 4 个哈希分隔的字符串，您可以使用此代码来计算单词数：

   public static int countTotalWords() {
        totalWords = 0;

        try {
            FileInputStream fis;
            fis = new FileInputStream(fileName);


            Scanner scan = new Scanner(fis);

            while (scan.hasNext()) {
                String wordsString = scan.next().substring(str.lastIndexOf("####") + 4, str.length());
                String[] wordsArr = wordsString.split(" ");
                int noOfWords = wordsArr.length;
                totalWords = totalWords + noOfWords;

            }
        } catch (FileNotFoundException ex) {
            Logger.getLogger(Opgave1.class.getName()).log(Level.SEVERE, null, ex);
        }
        return totalWords;
    }

score 0 · Accepted Answer

您可以使用lastIndexOf查找最后一个####。

所以给定一行你可以跳过前两个参数。

你试过你的代码吗？我不熟悉Scanner（我假设它允许逐行消费）但看起来你只是在数行。

java - 计算文本文件中的单词时跳过字符串的前几个单词

3 回答 3

Related

Reference