nlp - nltk 词干分析器：字符串索引超出范围

Question

我有一组腌制文本文档，我想使用 nltk 的PorterStemmer. 由于特定于我的项目的原因，我想在 django 应用程序视图中进行词干提取。

但是，在 django 视图中提取文档时，我收到了来自string的IndexError: string index out of range异常。结果，运行以下命令：PorterStemmer().stem()'oed'

# xkcd_project/search/views.py
from nltk.stem.porter import PorterStemmer

def get_results(request):
    s = PorterStemmer()
    s.stem('oed')
    return render(request, 'list.html')

引发上述错误：

Traceback (most recent call last):
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/exception.py", line 39, in inner
    response = get_response(request)
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/base.py", line 187, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/base.py", line 185, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/Users/jkarimi91/Projects/xkcd_search/xkcd_project/search/views.py", line 15, in get_results
    s.stem('oed')
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 665, in stem
    stem = self._step1b(stem)
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 376, in _step1b
    lambda stem: (self._measure(stem) == 1 and
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 258, in _apply_rule_list
    if suffix == '*d' and self._ends_double_consonant(word):
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 214, in _ends_double_consonant
    word[-1] == word[-2] and
IndexError: string index out of range

现在真正奇怪的是在 django 之外的相同字符串上运行相同的词干分析器（无论是单独的 python 文件还是交互式 python 控制台）都不会产生错误。换句话说：

# test.py
from nltk.stem.porter import PorterStemmer
s = PorterStemmer()
print s.stem('oed')

其次是：

python test.py
# successfully prints 'o'

是什么导致了这个问题？

score 31 · Accepted Answer

这是一个特定于 NLTK 版本 3.2.2 的 NLTK 错误，我应该为此负责。它是由 PR https://github.com/nltk/nltk/pull/1261引入的，它重写了 Porter 词干分析器。

我编写了一个修复程序，该修复程序在 NLTK 3.2.3 中发布。如果您使用的是 3.2.2 版本并且想要修复，只需升级 - 例如通过运行

pip install -U nltk

score 3 · Accepted Answer

我nltk.stem.porter使用pdb. 经过几次迭代，_apply_rule_list()你得到：

>>> rule
(u'at', u'ate', None)
>>> word
u'o'

此时该_ends_double_consonant()方法尝试执行word[-1] == word[-2]但失败。

如果我没记错的话，在 NLTK3.2中，相对方法如下：

def _doublec(self, word):
    """doublec(word) is TRUE <=> word ends with a double consonant"""
    if len(word) < 2:
        return False
    if (word[-1] != word[-2]):      
        return False        
    return self._cons(word, len(word)-1)

据我所知，len(word) < 2新版本中缺少检查。

更改_ends_double_consonant()为这样的东西应该可以工作：

def _ends_double_consonant(self, word):
      """Implements condition *d from the paper

      Returns True if word ends with a double consonant
      """
      if len(word) < 2:
          return False
      return (
          word[-1] == word[-2] and
          self._is_consonant(word, len(word)-1)
      )

我刚刚在相关的 NLTK 问题中提出了这个更改。

nlp - nltk 词干分析器：字符串索引超出范围

2 回答 2

Related

Reference