python - Python：检查单词列表中的任何单词是否与正则表达式模式列表中的任何模式匹配

Question

我在 .txt 文件中有很长的单词和正则表达式模式列表，我是这样读入的：

with open(fileName, "r") as f1:
    pattern_list = f1.read().split('\n')

为了说明，前七个看起来像这样：

print pattern_list[:7] 
# ['abandon*', 'abuse*', 'abusi*', 'aching', 'advers*', 'afraid', 'aggress*']

我想知道每当我将输入字符串中的单词与 pattern_list 中的任何单词/模式匹配时。以下类型的作品，但我看到两个问题：

首先，每次我检查新的 string_input 时 re.compile() 我的 pattern_list 中的每个项目似乎效率都很低...但是当我尝试将 re.compile(raw_str) 对象存储在列表中时（然后能够将已经编译的正则表达式列表重用于类似的东西 if w in regex_compile_list:，它不能正常工作。）
其次，它有时不像我预期的那样工作 - 注意如何
- 滥用*匹配滥用
- abusi* 与被虐待和虐待相匹配
- ache* 与疼痛相匹配

我做错了什么，我怎样才能更有效率？提前感谢您对菜鸟的耐心，并感谢您的任何见解！

string_input = "People who have been abandoned or abused will often be afraid of adversarial, abusive, or aggressive behavior. They are aching to abandon the abuse and aggression."
for raw_str in pattern_list:
    pat = re.compile(raw_str)
    for w in string_input.split():
        if pat.match(w):
            print "matched:", raw_str, "with:", w
#matched: abandon* with: abandoned
#matched: abandon* with: abandon
#matched: abuse* with: abused
#matched: abuse* with: abusive,
#matched: abuse* with: abuse
#matched: abusi* with: abused
#matched: abusi* with: abusive,
#matched: abusi* with: abuse
#matched: ache* with: aching
#matched: aching with: aching
#matched: advers* with: adversarial,
#matched: afraid with: afraid
#matched: aggress* with: aggressive
#matched: aggress* with: aggression.

score 10 · Accepted Answer

对于匹配 shell 样式的通配符，您可以（ab）使用该模块fnmatch

由于fnmatch主要设计用于文件名比较，测试将区分大小写或不取决于您的操作系统。所以你必须规范化文本和模式（在这里，我lower()用于此目的）

>>> import fnmatch

>>> pattern_list = ['abandon*', 'abuse*', 'abusi*', 'aching', 'advers*', 'afraid', 'aggress*']
>>> string_input = "People who have been abandoned or abused will often be afraid of adversarial, abusive, or aggressive behavior. They are aching to abandon the abuse and aggression."


>>> for pattern in pattern_list:
...     l = fnmatch.filter(string_input.split(), pattern)
...     if l:
...             print pattern, "match", l

生产：

abandon* match ['abandoned', 'abandon']
abuse* match ['abused', 'abuse']
abusi* match ['abusive,']
aching match ['aching']
advers* match ['adversarial,']
afraid match ['afraid']
aggress* match ['aggressive', 'aggression.']

score 2 · Accepted Answer

abandon*将匹配abandonnnnnnnnnnnnnnnnnnnnnnn，而不是abandonasfdsafdasf。你要

abandon.*

反而。

score 2 · Accepted Answer

如果*s 都在字符串的末尾，你可能想要做这样的事情：

for pat in pattern_list:
    for w in words:
        if pat[-1] == '*' and w.startswith(pat[:-1]) or w == pat:
            # Do stuff

score 1 · Accepted Answer

如果模式使用正则表达式语法：

m = re.search(r"\b({})\b".format("|".join(patterns)), input_string)
if m:
    # found match

使用(?:\s+|^)and(?:\s+|$)而不是\bif 单词是空格分隔的。

python - Python：检查单词列表中的任何单词是否与正则表达式模式列表中的任何模式匹配

4 回答 4

Related

Reference