3

我正在使用 python、NLTK 和 WordNetLemmatizer 开发词形分析器。这是一个随机文本,输出我所期望的

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lem = WordNetLemmatizer()
lem.lemmatize('worse', pos=wordnet.ADJ) // here, we are specifying that 'worse' is an adjective

输出:'bad'

lem.lemmatize('worse', pos=wordnet.ADV) // here, we are specifying that 'worse' is an adverb

输出:'worse'

嗯,这里的一切都很好。行为与其他形容词相同,例如'better'(对于不规则形式)或'older'(请注意,相同的测试'elder'永远不会输出'old',但我猜 wordnet 并不是所有现有英语单词的详尽列表)

我的问题是在尝试使用这个词时出现的'furter'

lem.lemmatize('further', pos=wordnet.ADJ) // as an adjective

输出:'further'

lem.lemmatize('further', pos=wordnet.ADV) // as an adverb

输出:'far'

这与'worse'单词的行为完全相反!

谁能解释我为什么?它是来自 wordnet synsets 数据的错误还是来自我对英语语法的误解?

如果问题已经得到解答,请原谅,我已经在 google 和 SO 上进行了搜索,但是当指定关键字“进一步”时,由于这个词的流行,我可以找到任何相关的东西,除了混乱......

提前谢谢你,Romain G。

4

1 回答 1

5

WordNetLemmatizer使用该._morphy函数访问其单词的引理;来自http://www.nltk.org/_modules/nltk/stem/wordnet.html并返回最小长度的可能引理。

def lemmatize(self, word, pos=NOUN):
    lemmas = wordnet._morphy(word, pos)
    return min(lemmas, key=len) if lemmas else word

并且该._morphy函数迭代地应用规则以获得引理;规则不断减少单词的长度并将词缀替换为MORPHOLOGICAL_SUBSTITUTIONS. 然后它会查看是否还有其他更短但与缩减词相同的词:

def _morphy(self, form, pos):
    # from jordanbg:
    # Given an original string x
    # 1. Apply rules once to the input to get y1, y2, y3, etc.
    # 2. Return all that are in the database
    # 3. If there are no matches, keep applying rules until you either
    #    find a match or you can't go any further

    exceptions = self._exception_map[pos]
    substitutions = self.MORPHOLOGICAL_SUBSTITUTIONS[pos]

    def apply_rules(forms):
        return [form[:-len(old)] + new
                for form in forms
                for old, new in substitutions
                if form.endswith(old)]

    def filter_forms(forms):
        result = []
        seen = set()
        for form in forms:
            if form in self._lemma_pos_offset_map:
                if pos in self._lemma_pos_offset_map[form]:
                    if form not in seen:
                        result.append(form)
                        seen.add(form)
        return result

    # 0. Check the exception lists
    if form in exceptions:
        return filter_forms([form] + exceptions[form])

    # 1. Apply rules once to the input to get y1, y2, y3, etc.
    forms = apply_rules([form])

    # 2. Return all that are in the database (and check the original too)
    results = filter_forms([form] + forms)
    if results:
        return results

    # 3. If there are no matches, keep applying rules until we find a match
    while forms:
        forms = apply_rules(forms)
        results = filter_forms(forms)
        if results:
            return results

    # Return an empty list if we can't find anything
    return []

但是,如果单词在例外列表中,它将返回保存在 中的固定值,exceptions请参见http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html_load_exception_map

def _load_exception_map(self):
    # load the exception file data into memory
    for pos, suffix in self._FILEMAP.items():
        self._exception_map[pos] = {}
        for line in self.open('%s.exc' % suffix):
            terms = line.split()
            self._exception_map[pos][terms[0]] = terms[1:]
    self._exception_map[ADJ_SAT] = self._exception_map[ADJ]

回到你的例子,worse->badfurther->far不能从规则中实现,因此它必须来自例外列表。由于它是一个例外列表,因此必然存在不一致之处。

例外列表保存在~/nltk_data/corpora/wordnet/adv.exc和中~/nltk_data/corpora/wordnet/adv.exc

来自adv.exc

best well
better well
deeper deeply
farther far
further far
harder hard
hardest hard

来自adj.exc

...
worldliest worldly
wormier wormy
wormiest wormy
worse bad
worst bad
worthier worthy
worthiest worthy
wrier wry
...
于 2014-04-11T06:54:22.807 回答