java - StanfordCoreNLP 不能以我的方式工作

Question

我使用下面的代码。然而，结果并不是我所期望的。结果是[machine, Learning] But I want to get [machine, learn]。我怎样才能做到这一点？另外，当我的输入是时"biggest bigger"，我想得到类似的结果[big, big]，但结果只是[biggest bigger]

（PS：我只是在我的eclipse中添加了这四个罐子：joda-time.jar, stanford-corenlp-3.3.1-models.jar, stanford-corenlp-3.3.1.jar, xom.jar我还需要添加一些吗？）

import java.util.LinkedList;
import java.util.List;
import java.util.Properties;

import edu.stanford.nlp.ling.CoreAnnotations.LemmaAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;

public class StanfordLemmatizer {

    protected StanfordCoreNLP pipeline;

    public StanfordLemmatizer() {
        // Create StanfordCoreNLP object properties, with POS tagging
        // (required for lemmatization), and lemmatization
        Properties props;
        props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma");


        this.pipeline = new StanfordCoreNLP(props);
    }

    public List<String> lemmatize(String documentText)
    {
        List<String> lemmas = new LinkedList<String>();
        // Create an empty Annotation just with the given text
        Annotation document = new Annotation(documentText);
        // run all Annotators on this text
        this.pipeline.annotate(document);
        // Iterate over all of the sentences found
        List<CoreMap> sentences = document.get(SentencesAnnotation.class);
        for(CoreMap sentence: sentences) {
            // Iterate over all tokens in a sentence
            for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
                // Retrieve and add the lemma for each word into the
                // list of lemmas
                lemmas.add(token.get(LemmaAnnotation.class));
            }
        }
        return lemmas;
    }


    // Test
    public static void main(String[] args) {
        System.out.println("Starting Stanford Lemmatizer");
        String text = "Machine Learning\n";
        StanfordLemmatizer slem = new StanfordLemmatizer();
        System.out.println(slem.lemmatize(text));
    }

}

score 4 · Accepted Answer

理想情况下，词形还原应该返回一组单词的规范形式（称为“引理”或“词条”）。然而，这种规范形式并不总是我们直觉所期望的。例如，您希望“学习”产生引理“学习”。但是名词“learning”有引理“learning”，而只有现在进行时动词“learning”有引理“learn”。在歧义的情况下，词形还原器应该依赖于来自词性标签的信息。

好吧，这解释了机器学习，但是大、更大和最大呢？

词形还原依赖于形态分析。斯坦福形态学类计算英语单词的基本形式，仅删除屈折变化（不是派生形态）。也就是说，它只处理名词复数、代词格和动词结尾，而不包括比较形容词或派生名词之类的东西。它基于 John Carroll 等人用 flex 编写的有限状态转换器。我找不到原始版本，但这里似乎有 Java 版本。

这就是为什么最大不会产生大的原因。

不过，WordNet 词法数据库会解析到正确的引理。我通常使用 WordNet 进行词形还原任务，到目前为止还没有发现任何重大问题。正确处理您的示例的另外两个众所周知的工具是

java - StanfordCoreNLP 不能以我的方式工作

1 回答 1

Related

Reference