python - 从 HTML 字符串中删除所有 div 标签

Question

我正在尝试剥离所有 div。

输入：

<p>111</p>

<div class="1334">bla</div>

<p>333</p>

<p>333</p>

<div some unkown stuff>bla2</div>

期望的输出：

   <p>111</p>

    <p>333</p>

    <p>333</p>

我试过这个，但它不工作：

release_content = re.sub("/<div>.*<\/div>/s", "", release_content)

score 8 · Accepted Answer

不要对这个问题使用正则表达式。使用 html 解析器。这是 Python 中使用 BeautifulSoup 的解决方案：

from BeautifulSoup import BeautifulSoup

with open('Path/to/file', 'r') as content_file:
    content = content_file.read()

soup = BeautifulSoup(content)
[div.extract() for div in soup.findAll('div')]

with open('Path/to/file.modified', 'w') as output_file:
    output_file.write(str(soup))

score 2 · Accepted Answer

python 中的正则表达式模式不需要任何分隔符：

release_content = re.sub("<div>.*<\/div>", "", release_content)

你确定divs 没有任何属性吗？嵌套div的 s 会发生什么？

score 2 · Accepted Answer

您正在使用贪婪运算符：*. 它会在停止之前尝试尽可能多的匹配。您可以尝试使用非贪婪版本，*?. 只要您没有嵌套<div>标签，您就可以了。

release_content = re.sub("(?s)<div>.*?<\/div>", "", release_content)

如果你可以有嵌套<div>标签，那么你会想要使用像BeautifulSoup这样的 HTML 库。

根据您的编辑，要考虑属性，您可以简单地修改<div>模式的前导：

release_content = re.sub("(?s)<div(?: [^>]*)?>.*?<\/div>", "", release_content)

python - 从 HTML 字符串中删除所有 div 标签

3 回答 3

Related

Reference