在构建一个轻量级的工具来检测被审查的亵渎行为时,我注意到在单词边界的末尾检测特殊字符是非常困难的。
使用字符串元组,我构建了一个 OR'd 单词边界正则表达式:
import re
PHRASES = (
'sh\\*t', # easy
'sh\\*\\*', # difficult
'f\\*\\*k', # easy
'f\\*\\*\\*', # difficult
)
MATCHER = re.compile(
r"\b(%s)\b" % "|".join(PHRASES),
flags=re.IGNORECASE | re.UNICODE)
问题是在*
单词边界旁边无法检测到\b
。
print(MATCHER.search('Well f*** you!')) # Fail - Does not find f***
print(MATCHER.search('Well f***!')) # Fail - Does not find f***
print(MATCHER.search('f***')) # Fail - Does not find f***
print(MATCHER.search('f*** this!')) # Fail - Does not find f***
print(MATCHER.search('secret code is 123f***')) # Pass - Should not match
print(MATCHER.search('f**k this!')) # Pass - Should find
有什么想法可以方便地进行设置以支持以特殊字符结尾的短语吗?