我想检查一组句子,看看句子中是否出现了一些种子词。但我想避免使用for seed in line
,因为那样会说种子词ring
会出现在带有这个词的文档中bring
。
我还想检查word with spaces
文档中是否出现多字表达式(MWE)。
我已经尝试过了,但这太慢了,有没有更快的方法呢?
seed = ['words with spaces', 'words', 'foo', 'bar',
'bar bar', 'foo foo foo bar', 'ring']
docs = ['these are words with spaces but the drinks are the bar is also good',
'another sentence at the foo bar is here',
'then a bar bar black sheep,
'but i dont want this sentence because there is just nothing that matches my list',
'i forgot to bring my telephone but this sentence shouldn't be in the seeded docs too']
docs_seed = []
for d in docs:
toAdd = False
for s in seeds:
if " " in s:
if s in d:
toAdd = True
if s in d.split(" "):
toAdd = True
if toAdd == True:
docs_seed.append((s,d))
break
print docs_seed
所需的输出应该是这样的:
[('words with spaces','these are words with spaces but the drinks are the bar is also good')
('foo','another sentence at the foo bar is here'),
('bar', 'then a bar bar black sheep')]