python - 优化 Python 中的查找和匹配代码

Question

我有一个将两个文件作为输入的代码：（1）字典/词典（2）一个文本文件（每行一个句子）

我的代码的第一部分以元组形式读取字典，因此输出如下内容：

('mthy3lkw', 'weakBelief', 'U')

('mthy3lkm', 'firmBelief', 'B')

('mthy3lh', 'notBelief', 'A')

代码的第二部分是在文本文件中的每个句子中搜索这些元组中位置 0 的单词，然后打印出句子、搜索词及其类型。

所以给定句子 mthy3lkw ana mesh 3arif ，期望的输出是：

["mthy3lkw ana mesh 3arif", ' mthy3lkw ', 'weakBelief', 'U'] 假设在字典中找到突出显示的单词。

我的代码的第二部分——匹配部分——太慢了。我怎样才能让它更快？

这是我的代码

findings = [] 
for sentence in data:  # I open the sentences file with .readlines()
    for word in tuples:  # similar to the ones mentioned above
        p1 = re.compile('\\b%s\\b'%word[0])  # get the first word in every tuple
        if p1.findall(sentence) and word[1] == "firmBelief":
            findings.append([sentence, word[0], "firmBelief"])

print findings

score 1 · Accepted Answer

1

将您的元组列表转换为trie，并将其用于搜索。

于 2012-10-07T03:20:19.070 回答

score 1 · Accepted Answer

构建一个 dict 查找结构，以便您可以快速从元组中找到正确的查找结构。然后你可以重组你的循环，而不是为每个句子遍历整个字典，尝试匹配每个条目，而是遍历句子中的每个单词并在字典中查找它：

# Create a lookup structure for words
word_dictionary = dict((entry[0], entry) for entry in tuples)

findings = []
word_re = re.compile(r'\b\S+\b') # only need to create the regexp once
for sentence in data:
    for word in word_re.findall(sentence): # Check every word in the sentence
        if word in word_dictionary: # A match was found
            entry = word_dictionary[word]
            findings.append([sentence, word, entry[1], entry[2]])

python - 优化 Python 中的查找和匹配代码

2 回答 2

Related

Reference