nlp - Porter and Lancaster stemming clarification

Question

I am doing stemming using Porter and Lancaster and I find these observations:

Input: replied
Porter: repli
Lancaster: reply


Input:  twice
porter:  twice
lancaster:  twic

Input:  came
porter:  came
lancaster:  cam

Input:  In
porter:  In
lancaster:  in

My question are:

Lancaster was supposed to be "aggressive" stemmer but it worked properly with replied. Why?
The word In remained the same in Porter with uppercase In, Why?
Notice that the Lancaster is removing words ending with e, Why?

I am not able to understand these concepts. Could you please help?

score 2 · Accepted Answer

问：Lancaster 应该是“激进的”词干提取器，但它与`replied`. 为什么？

这是因为在https://github.com/nltk/nltk/pull/1654中改进了 Lancaster 词干分析器的实现

如果我们看一下https://github.com/nltk/nltk/blob/develop/nltk/stem/lancaster.py#L62，有一个后缀规则，要更改-ied > -y

default_rule_tuple = (
    "ai*2.",   # -ia > -   if intact
    "a*1.",    # -a > -    if intact
    "bb1.",    # -bb > -b
    "city3s.", # -ytic > -ys
    "ci2>",    # -ic > -
    "cn1t>",   # -nc > -nt
    "dd1.",    # -dd > -d
    "dei3y>",  # -ied > -y
    ...)

该功能允许用户输入新规则，如果没有添加其他规则，那么它将使用将应用self.default_rule_tuple的parseRules地方https://github.com/nltk/nltk/blob/develop/nltk/stem/lancaster。 py#L196rule_tuple

def parseRules(self, rule_tuple=None):
    """Validate the set of rules used in this stemmer.
    If this function is called as an individual method, without using stem
    method, rule_tuple argument will be compiled into self.rule_dictionary.
    If this function is called within stem, self._rule_tuple will be used.
    """
    # If there is no argument for the function, use class' own rule tuple.
    rule_tuple = rule_tuple if rule_tuple else self._rule_tuple
    valid_rule = re.compile("^[a-z]+\*?\d[a-z]*[>\.]?$")
    # Empty any old rules from the rule set before adding new ones
    self.rule_dictionary = {}

    for rule in rule_tuple:
        if not valid_rule.match(rule):
            raise ValueError("The rule {0} is invalid".format(rule))
        first_letter = rule[0:1]
        if first_letter in self.rule_dictionary:
            self.rule_dictionary[first_letter].append(rule)
        else:
            self.rule_dictionary[first_letter] = [rule]

default_rule_tuple实际上来自paice-husk 词干分析器的嗖嗖实现，也就是 Lancaster 词干分析器https://github.com/nltk/nltk/pull/1661 =)

问：在 Porter 中，In 还是大写的 In，为什么？

这超级有趣！而且很可能是一个错误。

>>> from nltk.stem import PorterStemmer
>>> porter = PorterStemmer()
>>> porter.stem('In')
'In'

如果我们查看代码，首先将PorterStemmer.stem()其变为小写，https://github.com/nltk/nltk/blob/develop/nltk/stem/porter.py#L651

def stem(self, word):
    stem = word.lower()

    if self.mode == self.NLTK_EXTENSIONS and word in self.pool:
        return self.pool[word]

    if self.mode != self.ORIGINAL_ALGORITHM and len(word) <= 2:
        # With this line, strings of length 1 or 2 don't go through
        # the stemming process, although no mention is made of this
        # in the published algorithm.
        return word

    stem = self._step1a(stem)
    stem = self._step1b(stem)
    stem = self._step1c(stem)
    stem = self._step2(stem)
    stem = self._step3(stem)
    stem = self._step4(stem)
    stem = self._step5a(stem)
    stem = self._step5b(stem)

    return stem

但是如果我们查看代码，其他所有内容都返回stem小写的，但是有两个 if 子句返回某种形式的原始word未小写！

if self.mode == self.NLTK_EXTENSIONS and word in self.pool:
    return self.pool[word]

if self.mode != self.ORIGINAL_ALGORITHM and len(word) <= 2:
    # With this line, strings of length 1 or 2 don't go through
    # the stemming process, although no mention is made of this
    # in the published algorithm.
    return word

第一个 if 子句检查单词是否在self.pool包含不规则单词及其词干的内部。

第二个检查是否len(word)<= 2，然后返回它的原始形式，在“In”的情况下，第二个 if 子句返回 True，因此返回原始的非小写形式。

问：请注意，Lancaster 正在删除以“来”结尾的单词`e`，为什么？

毫不奇怪也来自default_rule_tuple https://github.com/nltk/nltk/blob/develop/nltk/stem/lancaster.py#L67，有一条规则会改变-e > -=）

问：如何禁用`-e > -`规则`default_rule_tuple`？

(Un-)幸运的是，该LancasterStemmer._rule_tuple对象是一个不可变的元组，所以我们不能简单地从中删除一个项目，但我们可以覆盖它 =)

>>> from nltk.stem import LancasterStemmer
>>> lancaster = LancasterStemmer()
>>> lancaster.stem('came')
'cam'

# Create a new stemmer object to refresh the cache.
>>> lancaster = LancasterStemmer()
>>> temp_rule_list = list(lancaster._rule_tuple)
# Find the 'e1>' rule.
>>> lancaster._rule_tuple.index('e1>') 
12

# Create a temporary rule list from the tuple.
>>> temp_rule_list = list(lancaster._rule_tuple)
# Remove the rule.
>>> temp_rule_list.pop(12)
'e1>'
# Override the `._rule_tuple` variable.
>>> lancaster._rule_tuple = tuple(temp_rule_list)

# Et voila!
>>> lancaster.stem('came')
'came'

nlp - Porter and Lancaster stemming clarification

1 回答 1

问：Lancaster 应该是“激进的”词干提取器，但它与replied. 为什么？

问：在 Porter 中，In 还是大写的 In，为什么？

问：请注意，Lancaster 正在删除以“来”结尾的单词e，为什么？

问：如何禁用-e > -规则default_rule_tuple？

Related

Reference

问：Lancaster 应该是“激进的”词干提取器，但它与`replied`. 为什么？

问：请注意，Lancaster 正在删除以“来”结尾的单词`e`，为什么？

问：如何禁用`-e > -`规则`default_rule_tuple`？