java - 如何使用扫描仪分隔符从文本文件中过滤掉非字母，包括 Java 中的单引号或撇号

Question

请我想对文件中的每个单词进行计数，并且该计数不应包括非字母，例如撇号、逗号、句号、问号、感叹号等，仅包括字母表中的字母。我尝试使用这样的分隔符，但它不包括撇号。

Scanner fileScanner = new Scanner("C:\\MyJavaFolder\\JavaAssignment1\\TestFile.txt");
    int totalWordCount = 0;

    //Firstly to count all the words in the file without the restricted characters 
    while (fileScanner.hasNext()) {
        fileScanner.useDelimiter(("[.,:;()?!\" \t\n\r]+")).next();
        totalWordCount++;
    }
    System.out.println("There are " + totalWordCount + " word(s)");

  //Then later I create an array to store each individual word in the file for counting their lengths.
    Scanner fileScanner2 = new Scanner("C:\\MyJavaFolder\\JavaAssignment1\\TestFile.txt");
    String[] words = new String[totalWordCount];
    for (int i = 0; i < totalWordCount; ++i) {
        words[i] = fileScanner2.useDelimiter(("[.,:;()?!\" \t\n\r]+")).next();
    }

这似乎不起作用！

请问我该怎么做？

score 2 · Accepted Answer

在我看来，您不想使用除空格和结束行之外的任何内容进行过滤。例如，如果您使用 ' 过滤您的单词数，“他们是”这个词将作为两个单词返回。以下是如何更改原始代码以使其工作的方法。

Scanner fileScanner = new Scanner(new File("C:\\MyJavaFolder\\JavaAssignment1\\TestFile.txt"));
    int totalWordCount = 0;
    ArrayList<String> words = new ArrayList<String>();

    //Firstly to count all the words in the file without the restricted characters 
    while (fileScanner.hasNext()) {
        //Add words to an array list so you only have to go through the scanner once
        words.add(fileScanner.next());//This defaults to whitespace
        totalWordCount++;
    }
    System.out.println("There are " + totalWordCount + " word(s)");
    fileScanner.close();

使用Pattern.compile()将您的字符串转换为正则表达式。'\s' 字符在 Pattern 类中预定义以匹配所有空白字符。

模式文档中有更多信息

此外，请确保在完成后关闭您的 Scanner 类。这可能会阻止您的第二个扫描仪打开。

编辑

如果要计算每个单词的字母，可以将以下代码添加到上面的代码中

int totalLetters = 0;
int[] lettersPerWord = new int[words.size()];
for (int wordNum = 0; wordNum < words.size(); wordNum++)
{
 String word = words.get(wordNum);
 word = word.replaceAll("[.,:;()?!\" \t\n\r\']+", "");
 lettersPerWord[wordNum] = word.length();
 totalLetters = word.length();
}

我已经测试了这段代码，它似乎对我有用。,replaceAll根据JavaDoc使用正则表达式进行匹配，因此它应该匹配任何这些字符并从本质上删除它。

score 1 · Accepted Answer

分隔符不是正则表达式，因此在您的示例中，它正在寻找在 "[.,:;()?!\" \t\n\r]+" 之间分割的东西

您可以使用正则表达式而不是分隔符

使用带有 group 方法的 regexp 类可能是您想要的。

String pattern = "(.*)[.,:;()?!\" \t\n\r]+(.*)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(test);
    if (m.find( )) {
        System.out.println("Found value: " + m.group(1) );
    }

玩这些类，你会发现它与你需要的更相似

score 0 · Accepted Answer

您可以在分隔符中尝试此正则表达式： fileScanner.useDelimiter(("[^a-zA-Z]|[^\']")).next();

这将使用任何非字母字符或非撇号作为分隔符。这样，您的单词将包含撇号，但不包含任何其他非字母字符。

然后，如果您希望长度准确，则必须遍历每个单词并检查撇号并考虑它们。您可以删除每个撇号，长度将与单词中的字母数匹配，或者您可以创建具有自己长度字段的单词对象，这样您就可以按原样打印单词，并知道其中的字母字符数单词。

java - 如何使用扫描仪分隔符从文本文件中过滤掉非字母，包括 Java 中的单引号或撇号

3 回答 3

Related

Reference