python-3.x - O(n) 字符串中单词列表的出现次数

Question

我已经看到了类似问题的答案： https ://stackoverflow.com/a/44311921/5881884

ahocorasick 算法用于显示列表中的每个单词是否存在于字符串中，时间为 O(n)。但我想获取字符串列表中每个单词的频率。

例如，如果

my_string = "some text yes text text some"
my_list = ["some", "text", "yes", "not"]

我想要结果：

[2, 3, 1, 0]

我在文档中没有找到一个确切的例子，知道如何做到这一点吗？

除了使用 ahocorasick 之外的其他 O(n) 解决方案也将不胜感激。

score 1 · Accepted Answer

执行：

这是一个 Aho-Corasick 频率计数器：

import ahocorasick

def ac_frequency(needles, haystack):
    frequencies = [0] * len(needles)
    # Make a searcher
    searcher = ahocorasick.Automaton()
    for i, needle in enumerate(needles):
        searcher.add_word(needle, i)
    searcher.make_automaton()
    # Add up all frequencies
    for _, i in searcher.iter(haystack):
        frequencies[i] += 1
    return frequencies

（对于您的示例，您会调用ac_frequency(my_list, my_string)以获取计数列表）

对于中大型输入，这将比其他方法快得多。

笔记：

对于真实数据，此方法可能会产生与发布的其他解决方案不同的结果，因为 Aho-Corasick 会查找目标单词的所有出现，包括子字符串。

如果您只想查找完整的单词，您可以searcher.add_word使用原始字符串的空格/标点符号填充版本进行调用：

    ...
    padding_start = [" ", "\n", "\t"]
    padding_end = [" ", ".", ";", ",", "-", "–", "—", "?", "!", "\n"]
    for i, needle in enumerate(needles):
        for s, e in [(s,e) for s in padding_start for e in padding_end]:
            searcher.add_word(s + needle + e, i)
    searcher.make_automaton()
    # Add up all frequencies
    for _, i in searcher.iter(" " + haystack + " "):
    ...

score 1 · Accepted Answer

Counter模块中的可能collections对您有用：

from collections import Counter

my_string = "some text yes text text some"
my_list = ["some", "text", "yes", "not"]

counter = Counter(my_string.split(' '))
[counter.get(item, 0) for item in my_list]

# out: [2, 3, 1, 0]

score 0 · Accepted Answer

您可以使用列表推导来计算特定列表在 my_string 中出现的次数：

[my_string.split().count(i) for i in my_list]
[2, 3, 1, 0]

score 0 · Accepted Answer

您可以使用字典来计算您关心的单词的出现次数：

counts = dict.fromkeys(my_list, 0) # initialize the counting dict with all counts at zero

for word in my_string.split():
    if word in counts:     # this test filters out any unwanted words
        counts[word] += 1  # increment the count

dict将counts保存每个单词的计数。如果您确实需要与原始关键字列表顺序相同的计数列表（并且 dict 不会这样做），您可以在循环完成后添加最后一步：

results = [counts[word] for word in my_list]

python-3.x - O(n) 字符串中单词列表的出现次数

4 回答 4

Related

Reference