java - 我需要一种优雅的方式来从处理中排除特定的单词

Question

我正在编写一种算法来从文档文本中提取可能的关键字。我想计算单词的实例并将前 5 个作为关键字。显然，我想排除“无关紧要”的词，以免每个文档都以“the”和“and”作为主要关键字出现。

这是我成功用于测试的策略：

exclusions = new ArrayList<String>();
exclusions.add("a","and","the","or");

现在我想做一个真实的测试，我的排除列表接近 200 字长，我希望能够做这样的事情：

exclusions = new ArrayList<String>();
exclusions.add(each word in foo.txt);

从长远来看，出于显而易见的原因，需要维护一个外部列表（而不是嵌入在我的代码中的列表）。使用 Java 中的所有文件读/写方法，我相当肯定可以做到这一点，但是我的搜索结果是空的……我知道我必须搜索错误的关键字。有人知道在处理中包含外部列表的优雅方式吗？

score 1 · Accepted Answer

您可以使用 a从文件FileReader中读取Strings 并将它们添加到ArrayList.

private List<String> createExculsions(String file) throws IOException {
   BufferedReader reader = new BufferedReader(new FileReader(file));
   String word = null;
   List<String> exclusions = new ArrayList<String>();

   while((word = reader.readLine()) != null) {
      exclusions.add(word);
   }

   return exclusions;
}

然后您可以使用List<String> exclusions = createExclusions("exclusions.txt");创建列表。

score 1 · Accepted Answer

这不会立即解决您开出的解决方案，但可能会为您提供另一种可能更好的途径。

与其事先决定什么是无用的，你可以计算所有东西，然后过滤掉你认为无关紧要的东西（从信息承载的角度来看），因为它存在压倒性的存在。它类似于信号处理中用于消除噪声的低通滤波器。

简而言之，计算一切。然后决定如果某些东西出现的频率高于您设置的阈值（您必须从实验中确定该阈值是多少，例如所有单词中有 5% 是“the”，这意味着它不携带信息）。

如果你这样做，它甚至适用于外语。

这只是我的两分钱。

score 0 · Accepted Answer

Google Guava 库包含许多有用的方法来简化日常任务。您可以使用其中之一将文件内容读取为字符串并按空格字符拆分：

String contents = Files.toString(new File("foo.txt"), Charset.defaultCharset());
List<String> exclusions = Lists.newArrayList(contents.split("\\s"));

Apache Commons IO 提供了类似的快捷方式：

String contents = FileUtils.readFileToString(new File("foo.txt"));
...

score 0 · Accepted Answer

不确定它是否优雅，但在这里我创建了一个简单的解决方案来检测语言或几年前从推文中删除噪音词：

TweetDetector.java
JTweet.java使用英语等数据

score 0 · Accepted Answer

Commons-io 具有支持此功能的实用程序。包含 commons-io 作为依赖项，然后发出

File myFile = ...;
List<String> exclusions = FileUtils.readLines( myFile );

如所述： http ://commons.apache.org/io/apidocs/org/apache/commons/io/FileUtils.html

这假设每个排除词都在一个新行上。

score 0 · Accepted Answer

从文件中读取非常简单。

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashSet;

public class ExcludeExample {
    public static HashSet<String> readExclusions(File file) throws IOException{
        BufferedReader br = new BufferedReader(new FileReader(file));
        String line = "";
        HashSet<String> exclusions = new HashSet<String>();
        while ((line = br.readLine()) != null) {
            exclusions.add(line);
        }
        br.close();
        return exclusions;
    }

    public static void main(String[] args) throws IOException{
        File foo = new File("foo.txt");
        HashSet<String> exclusions = readExclusions(foo);
        System.out.println(exclusions.contains("the"));
        System.out.println(exclusions.contains("Java"));
    }
}

foo.txt

the
a
and
or

我使用了 HashSet 而不是 ArrayList，因为它的查找速度更快。

java - 我需要一种优雅的方式来从处理中排除特定的单词

6 回答 6

Related

Reference