2

当该列表包含特殊类别时,如何测试一个短语是否在大型 (650k) 短语列表中?

例如,我想测试该短语["he", "had", "the", "nerve"]是否在列表中。它是,但在["he", "had", "!DETERMINER", "nerve"]where"!DETERMINER"是包含多个选择的词类的名称(a, an, the)。我有大约 350 个词类,其中一些很长,所以我认为枚举列表中具有一个(或多个)词类的每个项目是不可行的。

我想使用一组这些短语,而不是慢慢地浏览一个列表,但我不知道如何处理单词类的可变性。速度非常重要,因为我每次需要进行数十万次比较。

4

3 回答 3

1

与 pjwerneck 的建议类似,您可以使用树(或更具体地说是trie)将列表存储在部分中,但将其扩展为专门处理类别。

# phrase_trie.py

from collections import defaultdict

CATEGORIES = {"!DETERMINER": set(["a","an","the"]),
              "!VERB": set(["walked","talked","had"])}

def get_category(word):
    for name,words in CATEGORIES.items():
        if word in words:
            return name
    return None

class PhraseTrie(object):
    def __init__(self):
        self.children = defaultdict(PhraseTrie)
        self.categories = defaultdict(PhraseTrie)

    def insert(self, phrase):
        if not phrase: # nothing to insert
            return

        this=phrase[0]
        rest=phrase[1:]

        if this in CATEGORIES: # it's a category name
            self.categories[this].insert(rest)
        else:
            self.children[this].insert(rest)

    def contains(self, phrase):
        if not phrase:
            return True # the empty phrase is in everything

        this=phrase[0]
        rest=phrase[1:]

        test = False

        # the `if not test` are because if the phrase satisfies one of the
        # previous tests we don't need to bother searching more

        # allow search for ["!DETERMINER", "cat"]
        if this in self.categories: 
            test = self.categories[this].contains(rest)

        # the word is literally contained
        if not test and this in self.children:
            test = self.children[this].contains(rest)

        if not test:
            # check for the word being in a category class like "a" in
            # "!DETERMINER"
            cat = get_category(this)
            if cat in self.categories:
                test = self.categories[cat].contains(rest)
        return test

    def __str__(self):
        return '(%s,%s)' % (dict(self.children), dict(self.categories))
    def __repr__(self):
        return str(self)

if __name__ == '__main__':
    words = PhraseTrie()
    words.insert(["he", "had", "!DETERMINER", "nerve"])
    words.insert(["he", "had", "the", "evren"])
    words.insert(["she", "!VERB", "the", "nerve"])
    words.insert(["no","categories","here"])

    for phrase in ("he had the nerve",
                   "he had the evren",
                   "she had the nerve",
                   "no categories here",
                   "he didn't have the nerve",
                   "she had the nerve more"):
        print '%25s =>' % phrase, words.contains(phrase.split())

运行python phrase_trie.py

         he had the nerve => True
         he had the evren => True
        she had the nerve => True
       no categories here => True
 he didn't have the nerve => False
   she had the nerve more => False

关于代码的几点:

  • 的用途defaultdict是避免在调用之前检查该子树是否存在insert;它会在需要时自动创建和初始化。
  • 如果要进行大量调用get_category,则可能值得构建一个反向查找字典以提高速度。(或者,更好的是,记住调用,get_category以便常用词可以快速查找,但您不会浪费内存来存储您从未查找过的词。)
  • 该代码假定每个单词仅属于一个类别。(如果没有,唯一的变化是返回一个列表和循环这个列表get_category的相关部分。)PhraseTrie
于 2012-04-21T05:29:19.663 回答
0

首先,制作两个dicts:

partOfSpeech = {'a':'!DETERMINER', 'an':'!DETERMINER', 'the':'!DETERMINER'}
words = {'!DETERMINER': set(['a', 'an', 'the'])}

这应该 - 至少 - 让你加快速度。让我们看看这让你有多少加速,如果还不够,请发表评论,我会努力做得更好(或者 SO 社区的其他人甚至可以改进我的解决方案或提供更好的解决方案)。

于 2012-04-21T02:47:53.427 回答
0

如果速度很重要并且您必须处理词类,而不是存储短语和词类的列表,您应该考虑将其存储为单词树,这样孩子的深度就是它在短语中的位置。这样,您可以简单地查询每个级别,并且随着您向上移动,搜索范围会缩小。一旦找不到单词,您就知道该短语未列出。

这是一个非常幼稚的实现,仅作为您的示例。您甚至可以将其实现为像这样的嵌套字典,但如果您的数据很大且是动态的,您可能应该使用数据库:

tree = {'he':
        {'had':
         {'the': {'nerve': {}, 'power': {}, 'gift': {}},
          'a': {'car': {}, 'bike': {}},
          'an': {'airplane': {}, 'awful': {'smell': {}}}
          }
         }
        }


def find_phrase(phrase, root):    
    if not phrase:
        return True

    try:
        next = root[phrase[0]]
        return find_phrase(phrase[1:], next)
    except KeyError:
        return False

    return False


assert find_phrase(['he', 'had', 'the', 'nerve'], tree) is True
assert find_phrase(['he', 'had', 'the', 'power'], tree) is True
assert find_phrase(['he', 'had', 'the', 'car'], tree) is False
assert find_phrase(['he', 'had', 'an', 'awful', 'smell'], tree) is True
assert find_phrase(['he', 'had', 'an', 'awful', 'wife'], tree) is False
于 2012-04-21T03:04:06.973 回答