python - 如何通过正则表达式捕获特定标签内的所有标签？

Question

例如，有这样的代码

<tag1 blablablah>sometext<i>sometext</i>sometext<i>sometext</i>sometext</tag1>

我想做的就是让它像

<tag1 blablablah>sometext<XXX><i></XXX>sometext<XXX></i></XXX>sometext<XXX><i></XXX>sometext<XXX></i></XXX>sometext</tag1>

我正在使用正则表达式进行搜索（它也适用于 Notepad++ 和 Python 的 re.compile 函数）

(<tag1[^>]*>.*?)(<[^>]*>.*?)(.*?</tag1>)

并用于替换（它也适用于 re.sub）

\1<XXX>\2</XXX>\3

但是它只发现并更改了第一次出现的事件，而不是全部更改...

<tag1 blablablah>sometext<XXX><i></XXX>sometext</i>sometext<i>sometext</i>sometext</tag1>

谁能帮我这个？

score 2 · Accepted Answer

尝试这个

<((?:[a-z]+:)?[a-z]\w+)\b[^<>]+?>(.+)</\1>

解释

"
<              # Match the character “&lt;” literally
(              # Match the regular expression below and capture its match into backreference number 1
   (?:            # Match the regular expression below
      [a-z]          # Match a single character in the range between “a” and “z”
         +              # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
      :              # Match the character “:” literally
   )?             # Between zero and one times, as many times as possible, giving back as needed (greedy)
   [a-z]          # Match a single character in the range between “a” and “z”
   \w             # Match a single character that is a “word character” (letters, digits, and underscores)
      +              # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
\b             # Assert position at a word boundary
[^<>]          # Match a single character NOT present in the list “&lt;>”
   +?             # Between one and unlimited times, as few times as possible, expanding as needed (lazy)
>              # Match the character “&gt;” literally
(              # Match the regular expression below and capture its match into backreference number 2
   .              # Match any single character that is not a line break character
      +              # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
</             # Match the characters “&lt;/” literally
\1             # Match the same text as most recently matched by capturing group number 1
>              # Match the character “&gt;” literally
"

score 0 · Accepted Answer

0

try changing your pattern like this

(<tag1[^>]*>).*?(<[^>]+>).*?(</tag1>)

于 2012-05-29T16:50:59.277 回答

score 0 · Accepted Answer

问题是避免第一个和最后一个标签。如果你把它们分开，那么它很简单：

s = '<tag1 blablablah>sometext<i>sometext</i>sometext<i>sometext</i>sometext</tag1>'
start, end = s.find('>') + 1, s.rfind('<')
s_list = [s[:start], s[start:end], s[end:]]
s_list[1] = re.sub(r'(<[^>]*>)', r'<XXX>\1</XXX>', s_list[1])
print ''.join(s_list)

不过，它不是单行的。

或者，您可以这样做：

print re.sub(r'([^(^<)])(<[^>]*>(?!$))', r'\1<XXX>\2</XXX>', s)

请注意，这仅在您的最外层标签位于字符串的开头和结尾时才有效。

python - 如何通过正则表达式捕获特定标签内的所有标签？

3 回答 3

Related

Reference