这是一个可能的解决方案。我正在使用regex
,因为我可以通过这种方式轻松摆脱标点符号。collections.Counter
另外,如果您的字符串有很多重复的单词,我使用它可能会提高效率。
tag_list = ["art","paint"]
s = "This is such an nice artwork, very nice artwork. This is the best painting I've ever seen"
from collections import Counter
import re
words = re.findall(r'(\w+)', s)
dicto = Counter(words)
def found(s, tag):
return s.startswith(tag)
words_found = []
for tag in tag_list:
for k,v in dicto.iteritems():
if found(k, tag):
words_found.append((k,v))
最后一部分可以通过列表理解来完成:
words_found = [[(k,v) for k,v in dicto.iteritems() if found(k,tag)] for tag in tag_list]
结果:
>>> words_found
[('artwork', 2), ('painting', 1)]