python - 多字字符串的子字符串搜索 - Python

Question

我想检查一组句子，看看句子中是否出现了一些种子词。但我想避免使用for seed in line，因为那样会说种子词ring会出现在带有这个词的文档中bring。

我还想检查word with spaces文档中是否出现多字表达式（MWE）。

我已经尝试过了，但这太慢了，有没有更快的方法呢？

seed = ['words with spaces', 'words', 'foo', 'bar', 
        'bar bar', 'foo foo foo bar', 'ring']

 docs = ['these are words with spaces but the drinks are the bar is also good', 
    'another sentence at the foo bar is here', 
    'then a bar bar black sheep, 
    'but i dont want this sentence because there is just nothing that matches my list',
    'i forgot to bring my telephone but this sentence shouldn't be in the seeded docs too']

docs_seed = []
for d in docs:
  toAdd = False
  for s in seeds:
    if " " in s:
      if s in d:
        toAdd = True
    if s in d.split(" "):
      toAdd = True
    if toAdd == True:
      docs_seed.append((s,d))
      break
print docs_seed

所需的输出应该是这样的：

[('words with spaces','these are words with spaces but the drinks are the bar is also good')
('foo','another sentence at the foo bar is here'), 
('bar', 'then a bar bar black sheep')]

score 3 · Accepted Answer

考虑使用正则表达式：

import re

pattern = re.compile(r'\b(?:' + '|'.join(re.escape(s) for s in seed) + r')\b')
pattern.findall(line)

\b匹配“单词”（单词字符序列）的开头或结尾。

例子：

>>> for line in docs:
...     print pattern.findall(line)
... 
['words with spaces', 'bar']
['foo', 'bar']
['bar', 'bar']
[]
[]

score 0 · Accepted Answer

这应该工作并且比您当前的方法更快：

docs_seed = []
for d in docs:
    for s in seed:
        pos = d.find(s)
        if not pos == -1 and (d[pos - 1] == " " 
               and (d[pos + len(s)] == " " or pos + len(s) == len(d))):
            docs_seed.append((s, d))
            break

find给我们seed值在 doc 中的位置（如果未找到则为 -1），然后我们检查值之前和之后的字符是否为空格（或字符串在子字符串之后结束）。这也修复了您的原始代码中的错误，即多字表达式不需要在单词边界上开始或结束 - 您的原始代码将匹配"words with spaces"输入，例如"swords with spaces".

python - 多字字符串的子字符串搜索 - Python

2 回答 2

Related

Reference