我必须从文本文件中删除常用词,例如 (is,are,am,was 等)。在 java 中这样做的有效方法是什么?
问问题
3568 次
1 回答
4
您必须读入文件,跳过要删除的单词,然后再次将文件写回。
因此,您可能更愿意在每次阅读时跳过要忽略的单词 - 取决于您的用例。
要实际逐行删除单词(这可能不是您想要的方式),您可以这样做(使用google guava):
// the words you want to remove from the file:
//
Set<String> wordsToRemove = ImmutableSet.of("a", "for");
// this code will run in a loop reading one line after another from the file
//
String line = "Some words read from a file for example";
StringBuffer outputLine = new StringBuffer();
for (String word : Splitter.on(Pattern.compile("\\s+")).trimResults().omitEmptyStrings().split(line)) {
if (!wordsToRemove.contains(word)) {
if (outputLine.length() > 0) {
outputLine.append(' ');
}
outputLine.append(word);
}
}
// here I'm just printing, but this line could now be written to the output file.
//
System.out.println(outputLine.toString());
运行此代码将输出:
Some words read from file example
即,“a”和“for”被省略。
Notice that this makes for simple code, but, it will change the whitespace formatting in your file. If you had a line with doubled up spaces, tabs etc, then this all gets changed to a single space in this code. This is just an example of how you might do it, depending on your requirements, there will probably be better ways.
于 2012-04-20T10:18:11.297 回答