python - 仅当它们出现在 html 标记内时，如何删除换行符？

Question

对不起，另一个python新手问题。我有一个字符串：

my_string = "<p>this is some \n fun</p>And this is \n some more fun!"

我想：

my_string = "<p>this is some fun</p>And this is \n some more fun!"

换句话说，只有当它出现在 html 标记内时，我如何才能摆脱它？

我有：

my_string = re.sub('<(.*?)>(.*?)\n(.*?)</(.*?)>', 'replace with what???', my_string)

这显然行不通，但我被困住了。

score 5 · Accepted Answer

正则表达式不适合 HTML。不要这样做。请参阅RegEx match open tags except XHTML self-contained tags。

相反，请使用 HTML 解析器。Python 附带html.parser，或者您可以使用Beautiful Soup或html5lib。然后，您所要做的就是遍历树并删除换行符。

score 2 · Accepted Answer

您应该尝试使用 BeautifulSoup ( bs4)，这将允许您解析 XML 标记和页面。

>>> import bs4
>>> my_string = "<p>this is some \n fun</p>And this is \n some more fun!"
>>> soup = bs4.BeautifulSoup(my_string)
>>> p = soup.p.contents[0].replace('\n ','')
>>> print p

这将拉出 p 标签中的新行。如果内容有多个标签，None则可以使用 for 循环，然后收集子项（使用tag.child属性）。

例如：

>>> tags = soup.find_all(None)
>>> for tag in tags:
...    if tag.child is None:
...        tag.child.contents[0].replace('\n ', '')
...    else:
...        tag.contents[0].replace('\n ', '')

虽然，这可能不会完全按照您想要的方式工作（因为网页可能会有所不同），但可以根据您的需要复制此代码。

python - 仅当它们出现在 html 标记内时，如何删除换行符？

2 回答 2

Related

Reference