0

我有一个短语列表。我需要检查这些短语的一部分是否出现在一大段文本中。

例如

  • Marshmallows are delicious and warm
  • Giant unicorns sign wonderful melodies of the imminent apocalypse
  • The wizards assaulted the fort, but forgot their spell books at home!

文本块是:

Marshmallows are delicious. I've been snacking on them while the wizards assaulted the fort. The unicorns sign wonderful melodies of those who forgot their spell books at home. [...]


额外说明:

我不能依靠停用词来拆分,例如“and”、“or”和标点符号。


关于图书馆和/或策略的任何想法?

谢谢 :)

4

2 回答 2

1

您可以按长度降序创建每个短语的“部分”,然后在文本块中找到这些部分。

例如

>>> text = "Marshmallows are delicious. I've been snacking on them while the wizards assaulted the fort. The unicorns sign wonderful melodies of those who forgot their spell books at home."
>>> phrase='Giant unicorns sign wonderful melodies of the imminent apocalypse'
>>> words = phrase.split()
>>> parts = list()
>>> for length in range(len(words),3,-1): #Assuming a part is atleast 3 words
    for start in range(0,len(words)-length + 1):
        parts.append(' '.join(words[start:start+length]))
>>> #A step of -1 ensures the list is sorted in a decreasing order of length.
>>> parts
['Giant unicorns sign wonderful melodies of the imminent apocalypse', 'Giant unicorns sign wonderful melodies of the imminent', 'unicorns sign wonderful melodies of the imminent apocalypse', 'Giant unicorns sign wonderful melodies of the', 'unicorns sign wonderful melodies of the imminent', 'sign wonderful melodies of the imminent apocalypse', 'Giant unicorns sign wonderful melodies of', 'unicorns sign wonderful melodies of the', 'sign wonderful melodies of the imminent', 'wonderful melodies of the imminent apocalypse', 'Giant unicorns sign wonderful melodies', 'unicorns sign wonderful melodies of', 'sign wonderful melodies of the', 'wonderful melodies of the imminent', 'melodies of the imminent apocalypse', 'Giant unicorns sign wonderful', 'unicorns sign wonderful melodies', 'sign wonderful melodies of', 'wonderful melodies of the', 'melodies of the imminent', 'of the imminent apocalypse']
>>> for part in parts:
    if part.lower() in text.lower(): #for case insensitivity
        found = part
        break

>>> found
'unicorns sign wonderful melodies of'
于 2012-10-07T01:18:45.490 回答
0

查看 Xapian 以存储可搜索信息并检索它(结果 = 结果!)以及 Levenshtein 距离算法,其中有几个模块可供使用。

于 2012-10-07T01:21:38.797 回答