python - 如何在python中消除html文件的特定部分

Question

我正在处理一个包含第 1 项、第 2 项和第 3 项的 html 文件。我想删除最后一项 2 之后的所有文本。文件中可能有多个第 2 项。我正在使用它，但它不起作用：

text = """<A href="#106">Item&nbsp;2. <B>Item&nbsp;2. Properties</B> this is an example this is an example"""

>>> a=re.search ('(?<=<B>)Item&nbsp;2.',text)
>>> b= a.group(0)
>>> newText= text.partition(b)[0]
>>> newText
'<A href="#106">'

它会删除第一项 2 而不是第二项之后的文本。

score 1 · Accepted Answer

我会使用BeautifulSoup来解析 HTML 并修改它。您可能想要使用 decompose() 或 extract() 方法。

BeautifulSoup 很好，因为它非常擅长解析格式错误的 HTML。

对于您的具体示例：

>>> import bs4
>>> text = """<A href="#106">Item&nbsp;2. <B>Item&nbsp;2. Properties</B> this is an example this is an example"""
>>> soup = bs4.BeautifulSoup(text)
>>> soup.b.next_sibling.extract()
u' this is an example this is an example'
>>> soup
<html><body><a href="#106">Item 2. <b>Item 2. Properties</b></a></body></html>

如果您真的想使用正则表达式，非贪婪的正则表达式将适用于您的示例：

>>> import re
>>> text = """<A href="#106">Item&nbsp;2. <B>Item&nbsp;2. Properties</B> this is an example this is an example"""
>>> m = re.match(".*?Item&nbsp;2\.", text)
>>> m.group(0)
'<A href="#106">Item&nbsp;2.'

python - 如何在python中消除html文件的特定部分

1 回答 1

Related

Reference