python - Python 分组反向引用

Question

我正在清理一些html可能源自所见即所得的输出。为了理智起见，我想摆脱一堆空的格式化标签。

例如

<em></em> Here's some text <strong>   </strong> and here's more <em> <span></span></em>

多亏了Regular-Expressions.info，我有了一个带有反向引用的简洁正则表达式，可以一次打开一层

# Returns a string minus one level of empty formatting tags
def remove_empty_html_tags(input_string):
    return re.sub(r'<(?P<tag>strong|span|em)\b[^>]*>(\s*)</(?P=tag)>', r'\1', input_string)

但是，我希望能够一次解开所有层<em> <span></span></em>，并且可能有 5 层以上的嵌套空标签。

有没有办法将 backref a la (?:<?P<tagBackRef>strong|span|em)\b[^>]>(\s)*)+（或其他东西）分组并稍后使用它(</(?P=tagBackRef>)+来删除多个嵌套但匹配的空html标签？

为后代：

这可能是一个XY 问题，其中我希望用于我想要的结果的工具不是其他人会选择的工具。亨利的回答回答了这个问题，但他和其他所有人都会将您指向一个 html 解析器而不是用于解析 html 的正则表达式。=)

score 4 · Accepted Answer

使用 HTML 解析器（如BeautifulSoup）更容易做到这一点，例如：

from bs4 import BeautifulSoup

soup = BeautifulSoup("""
<body>
    <em></em> Here's some <span><strong>text</strong></span> <strong>   </strong> and here's more <em> <span></span></em>
</body>
""")

for element in soup.findAll(name=['strong', 'span', 'em']):
    if element.find(True) is None and (not element.string or not element.string.strip()):
        element.extract()

print soup

印刷：

<html><body>
 Here's some <span><strong>text</strong></span>  and here's more <em> </em>
</body></html>

如您所见，所有内容为空（或仅包含空格）的span,strong和标签都被删除了。em

另见：

删除/删除/提取空标签

score 1 · Accepted Answer

如果您真的不想使用HTML 解析器，并且您不太关心速度（我假设您不是，或者您不会使用正则表达式来清理您的 HTML），您可以修改代码你已经写过了。只需将您的替换放在一个循环（或递归；您的偏好）中，并在您不更改任何内容时返回。

# Returns a string minus all levels of empty formatting tags
def remove_empty_html_tags(input_string):
    matcher = r'<(?P<tag>strong|span|em)\b[^>]*>(\s*)</(?P=tag)>'
    old_string = input_string
    new_string = re.sub(matcher, r'\1', old_string)
    while new_string != old_string:
        old_string = new_string
        new_string = re.sub(matcher, r'\1', new_string)
    return new_string

python - Python 分组反向引用

为后代：

2 回答 2

Related

Reference