python - 根据另一个列表中的值搜索列表

Question

我有一个名字列表，我试图从字符串列表中提取出来。我不断收到误报，例如部分匹配。另一个需要注意的是，我希望它也可以在适用的情况下获取姓氏。

names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kimberly','Christmas is here', 'CHRIS']

desired_output = ['Chris Smith', 'Kimberly', 'CHRIS']

我试过这段代码：

[i for e in names for i in target if i.startswith(e)]

这可以预见地返回克里斯史密斯，圣诞节到了，金伯利。

我将如何最好地解决这个问题？使用正则表达式还是可以使用列表推导来完成？性能可能是一个问题，因为实名列表的长度约为 880,000 个。

（蟒蛇2.7）

编辑：我已经意识到我在这个例子中的标准是不切实际的，因为在排除圣诞节的同时想要包括 Kimberly 是不可能的要求。为了缓解这个问题，我找到了一个更完整的名单，其中包括变体（包括 Kim 和 Kimberly）。

score 1 · Accepted Answer

完全猜测（再次），因为我看不出你怎么不能Christmas is here给出任何合理的标准：

这将匹配任何以名称中的单词开头的单词的目标...

names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kimberly','Christmas is here', 'CHRIS']

import re
matches = [targ for targ in target if any(re.search(r'\b{}'.format(name), targ, re.I) for name in names)]
print matches
# ['Chris Smith', 'Kimberly', 'Christmas is here', 'CHRIS']

如果你把它改成\b{}\b' - then you'll get ['Chris Smith', 'CHRIS']这样你就输了Kim......

score 0 · Accepted Answer

这行得通吗？

names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kimberly','Christmas is here', 'CHRIS']

res = []
for tof in target:
    for name in names:
        if tof.lower().startswith(name.lower()):
            res.append(tof)
            break
print res

score 0 · Accepted Answer

没有确定的方式来删除匹配“圣诞节在这里”，因为系统可能无法确定圣诞节是一个名字还是其他东西。相反，如果你想加快这个过程，你可以试试这个 O(n) 方法。我没有计时，但绝对比您或建议的解决方案快。

from difflib import SequenceMatcher
names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kimberly','Christmas is here', 'CHRIS']
def foo(names, target):
    #Create a generator to search the names
    def bar(names, target):
            #which for each target
        for t in target:
                    #finds the matching blocks, a triplet, (i, j, n), and means that a[i:i+n] == b[j:j+n]
            match = SequenceMatcher(None,names, t).get_matching_blocks()[0]
                    #match.size == 0 means no match
                    #and match.b > 0 means match does not happens at the start
            if match.size > 0 and match.b == 0:
                            #and generate the matching target
                yield t
    #Join the names to create a single string
    names = ','.join(names)
    #and call the generator and return a list of the resultant generator
    return list(bar(names, target))

>>> foo(names, target)
['Chris Smith', 'Kimberly', 'Christmas is here', 'CHRIS']

score 0 · Accepted Answer

根据你的描述，我得到的规则是：

忽略此案；
目标词必须以关键字开头。
如果目标词不完全是关键词，那么目标词必须是句子中的唯一词。

试试这个：

names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kimberly','Christmas is here', 'CHRIS']
desired_output = ['Chris Smith', 'Kimberly', 'CHRIS']

actual_output = []
for key in names:
    for words in target:
        for word in words.split():
            if key.lower() == word.lower():
                actual_output.append(words)
            elif key.lower() == word.lower()[:len(key)] and len(words.split()) == 1:
                actual_output.append(words)
print(actual_output)

它将完全输出为您想要的输出（顺便说一句，您确定您真的想要这个吗？）。不要对 3 层循环感到沮丧。如果你有 N 个名字和 M 个句子，并且每个句子中的单词数是有限的，那么这段代码的复杂度O(mn)是再好不过了。

python - 根据另一个列表中的值搜索列表

4 回答 4

Related

Reference