parsing - Java Scanner hasNext(String) 方法有时不匹配

Question

我试图使用 Java ScannerhasNext方法，但得到了奇怪的结果。也许我的问题很明显，但是为什么这个简单的简单表达式"[a-zA-Z']+"不适用于这样的词：“points.anything, supervisor,”。我也试过这个"[\\w']+"。

public HashMap<String, Integer> getDocumentWordStructureFromPath(File file) {
    HashMap<String, Integer> dictionary = new HashMap<>();
    try {
        Scanner lineScanner = new Scanner(file);
        while (lineScanner.hasNextLine()) {
            Scanner scanner = new Scanner(lineScanner.nextLine());
            while (scanner.hasNext("[\\w']+")) {
                String word = scanner.next().toLowerCase();
                if (word.length() > 2) {
                    int count = dictionary.containsKey(word) ? dictionary.get(word).intValue() + 1 : 1;
                    dictionary.put(word, new Integer(count));
                }
            }
            scanner.close();
        }
        //scanner.useDelimiter(DELIMITER);
        lineScanner.close();

        return dictionary;

    } catch (FileNotFoundException e) { 
        e.printStackTrace();
        return null;
    }   
}

score 1 · Accepted Answer

您的正则表达式应该是这样[^a-zA-z]+的，因为您需要将所有不是字母的东西分开：

// previous code...
Scanner scanner = new Scanner(lineScanner.nextLine()).useDelimiter("[^a-zA-z]+");
    while (scanner.hasNext()) {
        String word = scanner.next().toLowerCase();
        // ...your other code
    }
}
// ... after code

编辑——为什么不使用 hasNext(String) 方法？

这一行：

Scanner scanner = new Scanner(lineScanner.nextLine());

它真正做的是为你编译一个 whitespce 模式，所以如果你有这个测试行"Hello World. A test, ok."，它会为你提供这个令牌：

你好
世界。
一种
测试，
行。

然后，如果您使用scanner.hasNext("[a-ZA-Z]+")您正在询问扫描仪if there is a token that match your pattern，对于此示例，它将说明true第一个令牌：

您好（因为这是与您指定的模式匹配的第一个标记）

对于下一个标记（World.）it doesn't match the pattern，它将简单地fail返回scanner.hasNext("[a-ZA-Z]+")，false因此它永远不会对任何不是字母的字符开头的单词起作用。你懂了？

现在......希望这会有所帮助。

parsing - Java Scanner hasNext(String) 方法有时不匹配

1 回答 1

Related

Reference