2

我有一个小模块,可以获取单词的引理及其复数形式。然后,它在句子中搜索以任一顺序包含两个单词(单数或复数)的句子。我让它工作,但我想知道是否有更优雅的方式来构建这个表达式。谢谢!注意:Python2

words = ((cell,), (wolf,wolves))
string1 = "(?:"+"|".join(words[0])+")"
string2 = "(?:"+"|".join(words[1])+")"
pat = ".+".join((string1, string2)) +"|"+ ".+".join((string2, string1))
# Pat output: "(?:cell).+(?:wolf|wolves)|(?:wolf|wolves).+(?:cell)"

然后搜索:

pat = re.compile(pat)
for sentence in sentences:
    if len(pat.findall(sentence)) != 0:
        print sentence+'\n'
4

2 回答 2

0

就像是:

[ x for x in sentences if re.search( '\bcell\b', x ) and
        ( re.search( '\bwolf\b', x ) or re.search( '\bwolves\b', x ) )]
于 2013-12-07T21:56:10.507 回答
0

问题在于,当您开始添加多个复合环视表达式时,您的算法复杂性就会失控。这将是使用正则表达式解决此问题的一个基本问题。

另一种方法是尝试使用 a 对每个句子进行一次 O(n) 传递Counter,然后对其进行查询:

#helper function
def count_lemma(counter,*args):
    return sum(counter[word] for word in args)

from collections import Counter
from string import punctuation

for sentence in sentences:
    c = Counter(x.rstrip(punctuation).lower() for x in sentence.split())
    if all(count_lemma(c,*word) for word in words):
        print sentence
于 2013-12-07T22:10:31.900 回答