nlp - 如何进行词干提取或词形还原？

Question

我尝试过 PorterStemmer 和 Snowball，但两者都不能处理所有单词，缺少一些非常常见的单词。

我的测试词是：“猫跑仙人掌仙人掌社区”，都答对了不到一半。

也可以看看：

score 145 · Accepted Answer

如果您了解 Python，自然语言工具包 (NLTK)有一个非常强大的词形还原器，它利用了WordNet。

请注意，如果您是第一次使用此词形还原器，则必须在使用之前下载语料库。这可以通过以下方式完成：

>>> import nltk
>>> nltk.download('wordnet')

您只需执行一次。假设您现在已经下载了语料库，它的工作方式如下：

>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('cars')
'car'
>>> lmtzr.lemmatize('feet')
'foot'
>>> lmtzr.lemmatize('people')
'people'
>>> lmtzr.lemmatize('fantasized','v')
'fantasize'

nltk.stem 模块中还有其他词形还原器，但我自己没有尝试过。

score 29 · Accepted Answer

我使用stanford nlp进行词形还原。在过去的几天里，我一直遇到类似的问题。多亏了 stackoverflow 帮我解决了这个问题。

import java.util.*; 
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.*; 
import edu.stanford.nlp.ling.CoreAnnotations.*;  

public class example
{
    public static void main(String[] args)
    {
        Properties props = new Properties(); 
        props.put("annotators", "tokenize, ssplit, pos, lemma"); 
        pipeline = new StanfordCoreNLP(props, false);
        String text = /* the string you want */; 
        Annotation document = pipeline.process(text);  

        for(CoreMap sentence: document.get(SentencesAnnotation.class))
        {    
            for(CoreLabel token: sentence.get(TokensAnnotation.class))
            {       
                String word = token.get(TextAnnotation.class);      
                String lemma = token.get(LemmaAnnotation.class); 
                System.out.println("lemmatized version :" + lemma);
            }
        }
    }
}

如果稍后在分类器中使用，使用停用词来最小化输出引理也可能是一个好主意。请看一下John Conwell 编写的coreNlp扩展。

score 24 · Accepted Answer

我在这个雪球演示网站上尝试了你的术语列表，结果看起来还不错....

猫->猫
运行->运行
跑->跑
仙人掌->仙人掌
仙人掌->仙人掌
社区 -> 社区
社区 -> 社区

词干分析器应该将词的变形形式转换为一些共同的词根。使该词根成为“正确”的字典词并不是词干分析器的工作。为此，您需要查看形态学/正交分析器。

我认为这个问题或多或少是同一件事，而 Kaarel 对这个问题的回答是我从第二个链接中获取的。

score 21 · Accepted Answer

词干分析器与词形还原器的争论还在继续。这是一个更喜欢精确而不是效率的问题。您应该进行词形还原以实现具有语言意义的单位，并使用最少的计算量来索引单词及其变体在同一键下。

参见Stemmers vs Lemmatizers

这是一个使用 python NLTK 的示例：

>>> sent = "cats running ran cactus cactuses cacti community communities"
>>> from nltk.stem import PorterStemmer, WordNetLemmatizer
>>>
>>> port = PorterStemmer()
>>> " ".join([port.stem(i) for i in sent.split()])
'cat run ran cactu cactus cacti commun commun'
>>>
>>> wnl = WordNetLemmatizer()
>>> " ".join([wnl.lemmatize(i) for i in sent.split()])
'cat running ran cactus cactus cactus community community'

score 9 · Accepted Answer

Martin Porter 的官方页面包含PHP和其他语言的 Porter Stemmer。

如果你真的很重视良好的词干提取，尽管你需要从 Porter 算法之类的东西开始，通过添加规则来修复数据集中常见的不正确案例来改进它，然后最后在规则中添加很多例外. 这可以通过键/值对（dbm/hash/dictionaries）轻松实现，其中键是要查找的单词，值是替换原始单词的词干。我曾经工作过的一个商业搜索引擎以修改后的 Porter 算法有 800 个例外。

score 6 · Accepted Answer

根据我遇到的 Stack Overflow 和博客上的各种答案，这是我正在使用的方法，它似乎可以很好地返回真实的话。这个想法是将传入的文本拆分成一个单词数组（使用您喜欢的任何方法），然后找到这些单词的词性 (POS) 并使用它来帮助词干和词形还原。

您上面的示例效果不太好，因为无法确定 POS。但是，如果我们使用一个真实的句子，事情就会好得多。

import nltk
from nltk.corpus import wordnet

lmtzr = nltk.WordNetLemmatizer().lemmatize


def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN


def normalize_text(text):
    word_pos = nltk.pos_tag(nltk.word_tokenize(text))
    lemm_words = [lmtzr(sw[0], get_wordnet_pos(sw[1])) for sw in word_pos]

    return [x.lower() for x in lemm_words]

print(normalize_text('cats running ran cactus cactuses cacti community communities'))
# ['cat', 'run', 'ran', 'cactus', 'cactuses', 'cacti', 'community', 'community']

print(normalize_text('The cactus ran to the community to see the cats running around cacti between communities.'))
# ['the', 'cactus', 'run', 'to', 'the', 'community', 'to', 'see', 'the', 'cat', 'run', 'around', 'cactus', 'between', 'community', '.']

score 5 · Accepted Answer

http://wordnet.princeton.edu/man/morph.3WN

对于我的很多项目，我更喜欢基于词典的 WordNet lemmatizer，而不是更具侵略性的搬运工词干提取。

http://wordnet.princeton.edu/links#PHP有一个指向 WN API 的 PHP 接口的链接。

score 3 · Accepted Answer

查看 WordNet，一个用于英语的大型词汇数据库：

http://wordnet.princeton.edu/

有用于以多种语言访问它的 API。

score 2 · Accepted Answer

这看起来很有趣：麻省理工学院 Java WordnetStemmer： http ://projects.csail.mit.edu/jwi/api/edu/mit/jwi/morph/WordnetStemmer.html

score 2 · Accepted Answer

看看LemmaGen - 用 C# 3.0 编写的开源库。

测试词的结果（http://lemmatise.ijs.si/Services）

猫->猫
跑步
跑->跑
仙人掌
仙人掌->仙人掌
仙人掌 -> 仙人掌
社区
社区 -> 社区

score 2 · Accepted Answer

2

于 2018-10-07T17:19:00.773 回答

score 1 · Accepted Answer

搜索 Lucene，我不确定是否有 PHP 端口，但我知道 Lucene 可用于许多平台。Lucene 是一个 OSS（来自 Apache）的索引和搜索库。自然而然，它和社区附加功能可能会有一些有趣的东西值得一看。至少您可以了解它是如何用一种语言完成的，这样您就可以将“想法”翻译成 PHP。

score 1 · Accepted Answer

NLTK 中最新版本的词干分析器是 Snowball。

您可以在此处找到有关如何使用它的示例：

http://nltk.googlecode.com/svn/trunk/doc/api/nltk.stem.snowball2-pysrc.html#demo

score 1 · Accepted Answer

If I may quote my answer to the question StompChicken mentioned:

The core issue here is that stemming algorithms operate on a phonetic basis with no actual understanding of the language they're working with.

As they have no understanding of the language and do not run from a dictionary of terms, they have no way of recognizing and responding appropriately to irregular cases, such as "run"/"ran".

If you need to handle irregular cases, you'll need to either choose a different approach or augment your stemming with your own custom dictionary of corrections to run after the stemmer has done its thing.

score 1 · Accepted Answer

您可以使用 Morpha 词干分析器。如果您打算从 Java 应用程序中使用它，UW 已将morpha stemmer 上传到 Maven 中心。有一个包装器使它更易于使用。您只需将其添加为依赖项并使用edu.washington.cs.knowitall.morpha.MorphaStemmer该类。实例是线程安全的（原来的 JFlex 有不必要的局部变量的类字段）。实例化一个类并运行morpha你想要阻止的单词。

new MorphaStemmer().morpha("climbed") // goes to "climb"

score 0 · Accepted Answer

在这里试试这个：http: //www.twinword.com/lemmatizer.php

我在演示中输入了您的查询"cats running ran cactus cactuses cacti community communities"并获得["cat", "running", "run", "cactus", "cactus", "cactus", "community", "community"]了可选标志ALL_TOKENS。

示例代码

这是一个 API，因此您可以从任何环境连接到它。下面是 PHP REST 调用的样子。

// These code snippets use an open-source library. http://unirest.io/php
$response = Unirest\Request::post([ENDPOINT],
  array(
    "X-Mashape-Key" => [API KEY],
    "Content-Type" => "application/x-www-form-urlencoded",
    "Accept" => "application/json"
  ),
  array(
    "text" => "cats running ran cactus cactuses cacti community communities"
  )
);

score 0 · Accepted Answer

Martin Porter wrote Snowball (a language for stemming algorithms) and rewrote the "English Stemmer" in Snowball. There are is an English Stemmer for C and Java.

He explicitly states that the Porter Stemmer has been reimplemented only for historical reasons, so testing stemming correctness against the Porter Stemmer will get you results that you (should) already know.

From http://tartarus.org/~martin/PorterStemmer/index.html (emphasis mine)

The Porter stemmer should be regarded as ‘<strong>frozen’, that is, strictly defined, and not amenable to further modification. As a stemmer, it is slightly inferior to the Snowball English or Porter2 stemmer, which derives from it, and which is subjected to occasional improvements. For practical work, therefore, the new Snowball stemmer is recommended. The Porter stemmer is appropriate to IR research work involving stemming where the experiments need to be exactly repeatable.

Dr. Porter suggests to use the English or Porter2 stemmers instead of the Porter stemmer. The English stemmer is what's actually used in the demo site as @StompChicken has answered earlier.

score 0 · Accepted Answer

.Net lucene 有一个内置的搬运工词干分析器。你可以试试。但请注意，波特词干在推导引理时不考虑词上下文。（通过算法及其实现，你会看到它是如何工作的）

score 0 · Accepted Answer

在 Java 中，我使用tartargus-snowball来提取词干

马文：

<dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-snowball</artifactId>
        <version>3.0.3</version>
        <scope>test</scope>
</dependency>

示例代码：

SnowballProgram stemmer = new EnglishStemmer();
String[] words = new String[]{
    "testing",
    "skincare",
    "eyecare",
    "eye",
    "worked",
    "read"
};
for (String word : words) {
    stemmer.setCurrent(word);
    stemmer.stem();
    //debug
    logger.info("Origin: " + word + " > " + stemmer.getCurrent());// result: test, skincar, eyecar, eye, work, read
}

score 0 · Accepted Answer

我强烈推荐使用Spacy（基础文本解析和标记）和Textacy（建立在 Spacy 之上的更高级别的文本处理）。

Spacy 中默认情况下可以使用词形还原词作为标记的.lemma_属性，并且可以在使用 textacy 进行许多其他文本预处理时对文本进行词形还原。例如，在创建一组术语或单词时，或者通常在执行一些需要它的处理之前。

我鼓励您在编写任何代码之前都检查一下，因为这可以为您节省大量时间！

score -1 · Accepted Answer

df_plots = pd.read_excel("Plot Summary.xlsx", index_col = 0)
df_plots
# Printing first sentence of first row and last sentence of last row
nltk.sent_tokenize(df_plots.loc[1].Plot)[0] + nltk.sent_tokenize(df_plots.loc[len(df)].Plot)[-1]

# Calculating length of all plots by words
df_plots["Length"] = df_plots.Plot.apply(lambda x : 
len(nltk.word_tokenize(x)))

print("Longest plot is for season"),
print(df_plots.Length.idxmax())

print("Shortest plot is for season"),
print(df_plots.Length.idxmin())



#What is this show about? (What are the top 3 words used , excluding the #stop words, in all the #seasons combined)

word_sample = list(["struggled", "died"])
word_list = nltk.pos_tag(word_sample)
[wnl.lemmatize(str(word_list[index][0]), pos = word_list[index][1][0].lower()) for index in range(len(word_list))]

# Figure out the stop words
stop = (stopwords.words('english'))

# Tokenize all the plots
df_plots["Tokenized"] = df_plots.Plot.apply(lambda x : nltk.word_tokenize(x.lower()))

# Remove the stop words
df_plots["Filtered"] = df_plots.Tokenized.apply(lambda x : (word for word in x if word not in stop))

# Lemmatize each word
wnl = WordNetLemmatizer()
df_plots["POS"] = df_plots.Filtered.apply(lambda x : nltk.pos_tag(list(x)))
# df_plots["POS"] = df_plots.POS.apply(lambda x : ((word[1] = word[1][0] for word in word_list) for word_list in x))
df_plots["Lemmatized"] = df_plots.POS.apply(lambda x : (wnl.lemmatize(x[index][0], pos = str(x[index][1][0]).lower()) for index in range(len(list(x)))))



#Which Season had the highest screenplay of "Jesse" compared to "Walt" 
#Screenplay of Jesse =(Occurences of "Jesse")/(Occurences of "Jesse"+ #Occurences of "Walt")

df_plots.groupby("Season").Tokenized.sum()

df_plots["Share"] = df_plots.groupby("Season").Tokenized.sum().apply(lambda x : float(x.count("jesse") * 100)/float(x.count("jesse") + x.count("walter") + x.count("walt")))

print("The highest times Jesse was mentioned compared to Walter/Walt was in season"),
print(df_plots["Share"].idxmax())
#float(df_plots.Tokenized.sum().count('jesse')) * 100 / #float((df_plots.Tokenized.sum().count('jesse') + #df_plots.Tokenized.sum().count('walt') + #df_plots.Tokenized.sum().count('walter')))

nlp - 如何进行词干提取或词形还原？

21 回答 21

Related

Reference