regex - 使用带有 RegEx 的 Apache POI 提取大写单词

Question

因此，我正在开发一个项目，以从 Java 中的 .doc 文件中提取大写单词。我正在使用正则表达式，但这就是我遇到一些问题的地方。我不熟悉正则表达式，但这就是我使用的。

private static final String REGEX = "[A-Z]+";

private void parseWordText(File file) throws IOException {
    FileInputStream fs = new FileInputStream(file);
    HWPFDocument doc = new HWPFDocument(fs);
    WordExtractor we = new WordExtractor(doc);
    if (we.getParagraphText() != null) {
        String[] dataArray = we.getParagraphText();
        for (int i = 0; i < dataArray.length; i++) {
            String data = dataArray[i].toString();
            Pattern p = Pattern.compile(REGEX);
            Matcher m = p.matcher(data);
            List<String> sequences = new Vector<String>();
            while (m.find()) {
                sequences.add(data.substring(m.start(), m.end()));
                System.out.println(data.substring(m.start(), m.end()));
            }
        }
    }
}

使用上面的代码和正则表达式，我得到所有大写字母，而不仅仅是所有大写单词。基本上你好不好，但你好。

score 1 · Accepted Answer

如果您想匹配单词边界，请使用\<and \>（并记住\'s 需要加倍才能将它们变成字符串，因此您应该编写\\<）。分别用于单词的开头和结尾（我认为“单词”定义为[a-zA-Z0-9_]+）。所以你的正则表达式是\<[A-Z]+\>. 请注意，这也匹配一个字母的单词（例如 the I，但不是 the H, in Here I am）。如果您不想要这些，请使用{2,}而不是+.

regex - 使用带有 RegEx 的 Apache POI 提取大写单词

1 回答 1

Related

Reference