1

(这个问题是关于一般的字符串检查而不是自然语言处理本身,但如果你把它看作一个 NLP 问题,想象它不是当前分析器可以分析的语言,为了简单起见,我将使用英文字符串例如)

假设只有 6 种可能的形式可以实现一个单词

  1. 首字母大写
  2. 带有“s”的复数形式
  3. 带有“es”的复数形式
  4. 大写+“es”
  5. 大写+“s”
  6. 没有复数或大写的基本形式

假设我想找到第一个实例的索引任何形式的单词coach出现在一个句子中,有没有更简单的方法来做这两种方法:

长 if 条件

sentence = "this is a sentence with the Coaches"
target = "coach"

print target.capitalize()

for j, i in enumerate(sentence.split(" ")):
  if i == target.capitalize() or i == target.capitalize()+"es" or \
     i == target.capitalize()+"s" or i == target+"es" or i==target+"s" or \
     i == target:
    print j

迭代尝试除外

variations = [target, target+"es", target+"s", target.capitalize()+"es",
target.capitalize()+"s", target.capitalize()]

ind = 0
for i in variations:
  try:
    j == sentence.split(" ").index(i)
    print j
  except ValueError:
    continue
4

2 回答 2

2

我建议看一下 NLTK 的 stem 包:http: //nltk.org/api/nltk.stem.html

使用它,您可以“从单词中删除形态词缀,只留下词干。词干算法旨在删除那些所需的词缀,例如语法角色、时态、派生形态,只留下词干。”

如果当前 NLTK 未涵盖您的语言,则应考虑扩展 NLTK。如果你真的需要一些简单的东西并且不关心 NLTK,那么你仍然应该将你的代码编写为一个小的、易于组合的实用函数的集合,例如:

import string 

def variation(stem, word):
    return word.lower() in [stem, stem + 'es', stem + 's']

def variations(sentence, stem):
    sentence = cleanPunctuation(sentence).split()
    return ( (i, w) for i, w in enumerate(sentence) if variation(stem, w) )

def cleanPunctuation(sentence):
    exclude = set(string.punctuation)
    return ''.join(ch for ch in sentence if ch not in exclude)

def firstVariation(sentence, stem):
    for i, w  in variations(sentence, stem):
        return i, w

sentence = "First coach, here another two coaches. Coaches are nice."

print firstVariation(sentence, 'coach')

# print all variations/forms of 'coach' found in the sentence:
print "\n".join([str(i) + ' ' + w for i,w in variations(sentence, 'coach')])
于 2012-11-05T17:54:03.940 回答
1

形态学通常是一种有限状态现象,因此正则表达式是处理它的完美工具。使用如下函数构建一个匹配所有案例的 RE:

def inflect(stem):
    """Returns an RE that matches all inflected forms of stem."""
    pat = "^[%s%s]%s(?:e?s)$" % (stem[0], stem[0].upper(), re.escape(stem[1:]))
    return re.compile(pat)

用法:

>>> sentence = "this is a sentence with the Coaches"
>>> target = inflect("coach")
>>> [(i, w) for i, w in enumerate(sentence.split()) if re.match(target, w)]
[(6, 'Coaches')]

If the inflection rules get more complicated than this, consider using Python's verbose REs.

于 2012-11-06T13:31:06.120 回答