python - 优化短语中字符串的搜索。不知道我需要使用哪些 Python 结构

Question

我有一个包含几个单词的文件，后跟一个整数（它的权重）：

home 10
house 15
village 20
city 50
big 15
small 5
pretty 10
...

等等。

如果它们匹配，我需要使用它的单词和包含在前一个文件中的单词来加权一些短语。

“我住在大城市的房子里”这句话的权重 0 + 0 + 0 + 0 + 15 + 0 + 0 + 10 + 50 = 75

这是我使用 Python 的第一种方法，即使我有使用 C 的良好经验：我遇到的困难是我无法达到所需的性能，因为我无法以正确的方式使用正确的 Python 结构. 我能够正确地加权短语，但使用几个“for”和一个函数调用，就像我使用 C 所做的那样。

def weight_word(word, words_file):
    fp = open(words_file)
    weight = 0
    line = fp.readline()
    while line:
    # One method I discovered to parse the line where there's
    # a word, a tab and its weight
    left, tab_char, right = line.partition('\t')
    if re.match(re.escape(word), left, re.I):
            # The previous re.match didn't guarantee an exact match so I need
            # even to control their lenghts...
        if len(word) == len(left): 
            weight = right
            break
        line = fp.readline()
    fp.close
    return float(weight)

def main():
    my_dict = {"dont parse me":"500", "phrase":"I live in a house in a small city", "dont parse me again":"560"}
    my_phrase = my_dict["phrase"].split()
    phrase_weight = 0
    for word in iter(my_phrase):
        phrase_weight = phrase_weight + weight_word(word, sys.argv[1])
    print "The weight of phrase is:" + str(phrase_weight)

现在我刚刚发现了一些可能对我的案例有用的东西，但我不知道如何正确使用它：

def word_and_weight(fp):
    global words_weight
    words_weight = {}
    for line in fp:
        word, weight = line.split('\t')
        words_weight[word] = int(weight)

我怎样才能避免对我的短语的每个单词的前一个 for 和对我的函数的调用，以及如何改用按单词索引的最后一种“数组”？我现在有点困惑。

score 2 · Accepted Answer

您的映射是一个字典：

>>> d = {'foo': 32, 'bar': 64}
>>> d['bar']
64

要获得句子的权重，您可以将各个单词的权重相加：

weight = 0

for word in sentence.split():
    weight += weights[word]

或使用正则表达式：

for word in re.finditer(r'(\w+)', sentence):
    ...

您可以使用sum和生成器使其更简洁：

weight = sum(weights[word] for word in sentence.split())

如果某些单词不在您的字典中，您可以使用dict.get()的第二个参数返回0以防单词不在其中：

weight = sum(weights.get(word, 0) for word in sentence.split())

score 1 · Accepted Answer

您的第一次通过算法是打开并解析您的词组中的每个单词的单词文件，无论语言如何，这显然都是不好的。您的word_and_weight功能不那么糟糕，但您不需要全局变量。假设您出于某种原因将 my_dict 设置为它的方式，并且不介意您的权重文件中缺少输入保护，我会这样做：

import fileinput

def parse_word_weights():
    word_weights = {}
    for line in fileinput.input():
        word, weight = line.strip().split('\t')
        word_weights[word] = int(weight)
    return word_weights

def main():
    word_weights = parse_word_weights()

    my_dict = {"dont parse me":"500", "phrase":"I live in a house in a small city", "dont parse me again":"560"}
    my_phrase = my_dict["phrase"].split()
    phrase_weight = sum((word_weights.get(word, 0) for word in my_phrase))
    print "The weight of phrase is:" + str(phrase_weight)

这使用了 fileinput 标准库来标准化文件输入——这距离唯一的选择还有很长的路要走，但它非常方便。sum 调用在生成器表达式上运行，该生成器表达式将依次懒惰地评估每个单词的单词查找。

明确的 for 循环将短语权重相加并没有什么问题，但 sum 调用更惯用。如果您坚持使用 for 循环，则不需要对 my_phrase 进行 iter 调用 - split 的输出可以直接迭代。

python - 优化短语中字符串的搜索。不知道我需要使用哪些 Python 结构

2 回答 2

Related

Reference