python - 匹配任意拆分为多行的字符串

Question

正则表达式中是否有办法匹配任意拆分为多行的字符串 - 假设我们在文件中有以下格式：

msgid "This is "
"an example string"
msgstr "..."

msgid "This is an example string"
msgstr "..."

msgid ""
"This is an " 
"example" 
" string"
msgstr "..."

msgid "This is " 
"an unmatching string" 
msgstr "..."

所以我们想要一个匹配所有示例字符串的模式，即：匹配字符串，不管它是如何跨行分割的。请注意，我们是在示例中显示的特定字符串之后，而不仅仅是任何字符串。所以在这种情况下，我们想匹配字符串"This is an example string"。

当然，我们可以轻松地连接字符串然后应用匹配，但让我想知道这是否可能。我说的是Python正则表达式，但一般的答案是可以的。

score 4 · Accepted Answer

你想匹配一系列单词吗？如果是这样，您可以查找中间只有空格 (\s) 的单词，因为 \s 匹配换行符和空格。

import re

search_for = "This is an example string"
search_for_re = r"\b" + r"\s+".join(search_for.split()) + r"\b"
pattern = re.compile(search_for_re)
match = lambda s: pattern.match(s) is not None

s = "This is an example string"
print match(s), ":", repr(s)

s = "This is an \n example string"
print match(s), ":", repr(s)

s = "This is \n an unmatching string"
print match(s), ":", repr(s)

印刷：

True : 'This is an example string'
True : 'This is an \n example string'
False : 'This is \n an unmatching string'

score 0 · Accepted Answer

这有点棘手，因为每行都需要引号，并且允许空行。这是一个与您正确发布的文件匹配的正则表达式：

'(""\n)*"This(( "\n(""\n)*")|("\n(""\n)*" )| )is(( "\n(""\n)*")|("\n(""\n)*" )| )an(( "\n(""\n)*")|("\n(""\n)*" )| )example(( "\n(""\n)*")|("\n(""\n)*" )| )string"'

这有点令人困惑，但它只是您要匹配的字符串，但它以：

(""\n)*"

并将每个单词之间的空格替换为：

(( "\n(""\n)*")|("\n(""\n)*" )| )

它在每个单词之后检查三种不同的可能性，要么是“空格、引号、换行符、（无限数量的空字符串）引号”，要么是相同的序列，但末尾有更多的空间，或者只是一个空格。

让这个工作更简单的方法是编写一个小函数，该函数将接收您尝试匹配的字符串并返回将匹配它的正则表达式：

def getregex(string):
    return '(""\n)*"' + string.replace(" ", '(( "\n(""\n)*")|("\n(""\n)*" )| )') + '"'

因此，如果您将发布的文件放在名为“filestring”的字符串中，您将获得如下匹配：

import re

def getregex(string):
    return '(""\n)*"' + string.replace(" ", '(( "\n(""\n)*")|("\n(""\n)*" )| )') + '"'

matcher = re.compile(getregex("This is an example string"))

for i in matcher.finditer(filestring):
    print i.group(0), "\n"

>>> "This is "
    "an example string"

    "This is an example string"

    ""
    "This is an "
    "example"
    " string"

此正则表达式没有考虑第三个 msgid 中“示例”之后的空间，但我认为这是由机器生成的，这是一个错误。

python - 匹配任意拆分为多行的字符串

2 回答 2

Related

Reference