java - Java Stanford NLP：查找词频？

Question

我正在使用斯坦福 NLP 解析工具包。给定词典中的一个词，我怎样才能找到它的频率*？或者，给定一个频率等级，我如何确定相应的单词？

*在整个语言中，而不仅仅是文本示例。

这是我正在使用的工具包的演示：

class ParserDemo {
  public static void main(String[] args) {
    LexicalizedParser lp = new LexicalizedParser("englishPCFG.ser.gz");
    lp.setOptionFlags(new String[]{"-maxLength", "80", "-retainTmpSubcategories"});

    String[] sent = { "Sincerity", "may", "frighten", "the", "boy", "." };
    Tree parse = (Tree) lp.apply(Arrays.asList(sent));
    parse.pennPrint();
    System.out.println();

    TreebankLanguagePack tlp = new PennTreebankLanguagePack();
    GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
    GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
    Collection tdl = gs.typedDependenciesCollapsed();
    System.out.println(tdl);
    System.out.println();

    TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
    tp.printTree(parse);
  }

}

score 1 · Accepted Answer

如果您只计算词频，则无需进行句子解析。您需要做的就是对输入进行标记，然后使用 java 计算词频HashMap。如果您想使用斯坦福工具，请使用edu.stanford.nlp.process.

这为您提供了任何给定单词的频率，但通常可能无法找到与给定频率等级对应的单词，因为文档中的某些单词可能同样频繁。

score 0 · Accepted Answer

这是一个比 NLP 更多的 IR（信息检索）问题。应该查看像Lucene这样的库来完成这项任务。

java - Java Stanford NLP：查找词频？

2 回答 2

Related

Reference