python - 将命令行参数转换为正则表达式

Question

比如说，我想知道模式“\section”是否在文本“abcd\sectiondefghi”中。当然，我可以这样做：

import re

motif = r"\\section"
txt = r"abcd\sectiondefghi"
pattern = re.compile(motif)
print pattern.findall(txt)

那会给我我想要的。但是，每次我想在新文本中找到新模式时，我都必须更改代码，这很痛苦。因此，我想写一些更灵活的东西，像这样（test.py）：

import re
import sys

motif = sys.argv[1]
txt = sys.argv[2]
pattern = re.compile(motif)
print pattern.findall(txt)

然后，我想像这样在终端中运行它：

python test.py \\section abcd\sectiondefghi

但是，这行不通（我讨厌使用\\\\section）。

那么，有什么方法可以将我的用户输入（来自终端或来自文件）转换为 python 原始字符串？或者有没有更好的方法从用户输入进行正则表达式模式编译？

非常感谢你。

score 27 · Accepted Answer

用于re.escape()确保输入文本在正则表达式中被视为文字文本：

pattern = re.compile(re.escape(motif))

演示：

>>> import re
>>> motif = r"\section"
>>> txt = r"abcd\sectiondefghi"
>>> pattern = re.compile(re.escape(motif))
>>> txt = r"abcd\sectiondefghi"
>>> print pattern.findall(txt)
['\\section']

re.escape()转义所有非字母数字；在每个这样的字符前面添加一个反斜杠：

>>> re.escape(motif)
'\\\\section'
>>> re.escape('\n [hello world!]')
'\\\n\\ \\[hello\\ world\\!\\]'

score 2 · Accepted Answer

一种方法是使用参数解析器，例如optparseor argparse。

您的代码将如下所示：

import re
from optparse import OptionParser

parser = OptionParser()
parser.add_option("-s", "--string", dest="string",
                  help="The string to parse")
parser.add_option("-r", "--regexp", dest="regexp",
                  help="The regular expression")
parser.add_option("-a", "--action", dest="action", default='findall',
                  help="The action to perform with the regexp")

(options, args) = parser.parse_args()

print getattr(re, options.action)(re.escape(options.regexp), options.string)

我使用它的一个例子：

> code.py -s "this is a string" -r "this is a (\S+)"
['string']

使用您的示例：

> code.py -s "abcd\sectiondefghi" -r "\section"
['\\section'] 
# remember, this is a python list containing a string, the extra \ is okay.

score 2 · Accepted Answer

因此，为了清楚起见，您搜索的内容（在您的示例中为“\section”）应该是正则表达式还是文字字符串？如果是后者，则该re模块并不是真正适合该任务的工具；给定一个搜索字符串needle和一个目标字符串haystack，你可以这样做：

# is it in there
needle in haystack

# how many copies are there
n = haystack.count(needle)
python test.py \\section abcd\sectiondefghi
# where is it
ix = haystack.find(needle)

所有这些都比基于正则表达式的版本更有效。

re.escape如果您需要在运行时将文字片段插入到更大的正则表达式中，它仍然很有用，但如果您最终这样做re.compile(re.escape(needle))，在大多数情况下，有更好的工具来完成任务。

编辑：我开始怀疑这里的真正问题是 shell 的转义规则，这与 Python 或原始字符串无关。也就是说，如果您键入：

python test.py \\section abcd\sectiondefghi

在 Unix 风格的 shell 中，"\section" 部分在 Python 看到之前被 shell 转换为 "\section"。解决这个问题的最简单方法是告诉 shell 跳过转义，您可以通过将参数放在单引号内来做到这一点：

python test.py '\\section' 'abcd\sectiondefghi'

比较和对比：

$ python -c "import sys; print ','.join(sys.argv)" test.py \\section abcd\sectiondefghi
-c,test.py,\section,abcdsectiondefghi

$ python -c "import sys; print ','.join(sys.argv)" test.py '\\section' 'abcd\sectiondefghi'
-c,test.py,\\section,abcd\sectiondefghi

（在这里明确地在连接的字符串上使用 print 以避免repr增加更多的混乱......）

python - 将命令行参数转换为正则表达式

3 回答 3

Related

Reference