java - 从文件中检索单词的正则表达式

Question

我有一组文件特别是目录。

从目录中的所有文件（文本文件）中检索内容后，我有一个字符串列表。

每个字符串元素代表从每个文件中检索到的内容。所以列表中的第一个 String 元素代表第一个文件的内容。

现在我想拆分字符串以获取单词。（稍后将单词存储到字符串数组中）1）单词可以用单个空格/多个空格分隔。2) 句子以“.”结尾，所以可以在“.”之后开始一个新词。3) 新词可以在 '\n' 之后开始

那么任何人都可以建议一个适合 split() 方法的正则表达式吗？

score 4 · Accepted Answer

也许该StringTokenizer课程更适合您的需要。构造函数采用要标记的字符串和分隔符列表（在您的情况下：空格、. 和换行符）。

score 1 · Accepted Answer

1

String[] result = myString.split("[\\.\\s]");

于 2012-04-13T11:19:11.593 回答

score 0 · Accepted Answer

您可能不需要正则表达式，只需从文件中删除每个非字母字符，然后使用 Tokenizer 读取每个单词。

score -1 · Accepted Answer

我建议为此使用标记……只需遍历每个字符并根据字符是什么来决定要做什么。这是伪代码

string word = "";

while ( EOF ){

    char = getNextChar()

    if ( char not space or full-stop ){
        append the char to the word
    }
    else {
        if ( the word is empty ){ continue /* ignore multi space */ }
        else {
            add the word to an array of words
            reset the word to ""
        }
    }
}

这样，您可以完全控制处理数据的方式 - 您不必担心包含在正则表达式规则中的疯狂场景。最重要的是，这是最有效的方式（def 比 regex 更好），并且您只需通过数据一次。

java - 从文件中检索单词的正则表达式

4 回答 4

Related

Reference