1

对于以下示例,我在 python 中得到误报。我正在尝试查找字符串中是否存在关键字。问题是字符串中的单词通常由下划线或连字符连接,所以我只希望在关键字不存在时出现肯定的结果。它可以被连字符、下划线或任何不是字母的东西包围,被认为是真实的结果。通常它应该被下划线或连字符包围。它也不区分大小写。

test_list = ['server_test', 'server_dev', 'server_uat', 'server_dr', 'server-dr-NA', 'server-DR', 'dress_prod', 'testosterone','uatae','devacurl', 'dev_server']

结果应该输出这个 True/False 列表

[True, True, True, True, True, True, False, False, False, False, True]

执行:

key_words = ['uat','dr','test','qa','dev']
for name in test_list:
     if any(x in name.lower() for x in key_words):
         print('True')
     else:
         print('False')

结果:

True
True
True
True
True
True
True
True
True
True  

在python中有没有更好的方法来做到这一点?

如果不是,我将如何在 python 中使用正则表达式来做到这一点?

请记住,这是在性能确实很重要的大型数据集上循环的。

4

5 回答 5

2

鉴于:

>>> test_list = ['server_test', 'server_dev', 'server_uat', 'server_dr', 'server-dr-NA', 'server-DR', 'dress_prod', 'testosterone','uatae','devacurl', 'dev_server']
>>> key_words = ['uat','dr','test','qa','dev']

您可以使用re.splitany

>>> [any(word.lower() in key_words for word in re.split(r'[^a-zA-Z]', s))
...     for s in test_list]
[True, True, True, True, True, True, False, False, False, False, True]

这与您的目标相同:

>>> tgt=[True, True, True, True, True, True, False, False, False, False, True]
>>> [any(word.lower() in key_words for word in re.split(r'[^a-zA-Z]', s))
...     for s in test_list]==tgt
True
于 2015-11-19T02:34:31.953 回答
1

使用基于负后向的正则表达式。

>>> test_list = ['server_test', 'server_dev', 'server_uat', 'server_dr', 'server-dr-NA', 'server-DR', 'dress_prod', 'testosterone','uatae','devacurl', 'dev_server']
>>> key_words = ['uat','dr','test','qa','dev']
>>> [True if re.search(r'(?i)(?<![a-z])(?:' + '|'.join(key_words) + ')(?![a-z])', i) else False for i in test_list]
[True, True, True, True, True, True, False, False, False, False, True]
>>> 
于 2015-11-19T01:59:32.907 回答
0
import re

key_words = ['uat','dr','test','qa','dev']
test_list = ['server_test', 'server_dev', 'server_uat', 'server_dr', 'server-dr-NA', 
             'server-DR', 'dress_prod', 'testosterone','uatae','devacurl', 'dev_server']



def check(word):
    parts = re.split('[^a-z]', word.lower())
    return any(part in key_words for part in parts)

print([check(item) for item in test_list])
于 2015-11-19T02:21:27.193 回答
0

我认为这种模式很容易理解和修改:

import re

pattern = r'.*(^|[^a-z])({names})([^a-z]|$).*'.format(names='|'.join(key_words))

# .*(^|[^a-z])(uat|dr|test|qa|dev)([^a-z]|$).*

for name in test_list:
    print(bool(re.search(pattern, name, re.IGNORECASE)))
于 2015-11-19T02:05:44.983 回答
0

另一种方法是用来\b检测字边界。不幸的是, _被认为是一个单词字符,所以我们需要检测\b or _

不像 Avinash 的解决方案那样简洁或高效,但可能更具可读性。

import re

test_list = ['server_test', 'server_dev', 'server_uat', 'server_dr',
             'server-dr-NA', 'server-DR', 'dress_prod', 'testosterone',
             'uatae', 'devacurl', 'dev_server']

key_words = ['uat','dr','test','qa','dev']

for name in test_list:
    for kw in key_words:
        regex = r'(\b|_)'+kw+r'(\b|_)'
        if re.search(regex, name, re.IGNORECASE):
            print('True')
            break  # exit "for kw" loop
    else:  # only executed if "for kw" loop exits via exhaustion, not via break
        print('False')
于 2015-11-19T02:05:23.537 回答