python - 我怎样才能
转身
换行？

Question

假设我有一个带有和 标签的 HTML。之后，我将剥离 HTML 以清理标签。我怎样才能把它们变成换行符？

我正在使用 Python 的BeautifulSoup库，如果这有帮助的话。

score 14 · Accepted Answer

如果没有一些细节，很难确定这完全符合您的要求，但这应该会给您一个想法……它假设您的 b 标签包含在 p 元素中。

from BeautifulSoup import BeautifulSoup
import six

def replace_with_newlines(element):
    text = ''
    for elem in element.recursiveChildGenerator():
        if isinstance(elem, six.string_types):
            text += elem.strip()
        elif elem.name == 'br':
            text += '\n'
    return text

page = """<html>
<body>
<p>America,<br>
Now is the<br>time for all good men to come to the aid<br>of their country.</p>
<p>pile on taxpayer debt<br></p>
<p>Now is the<br>time for all good men to come to the aid<br>of their country.</p>
</body>
</html>
"""

soup = BeautifulSoup(page)
lines = soup.find("body")
for line in lines.findAll('p'):
    line = replace_with_newlines(line)
    print line

运行此结果...

(py26_default)[mpenning@Bucksnort ~]$ python thing.py
America,
Now is the
time for all good men to come to the aid
of their country.
pile on taxpayer debt

Now is the
time for all good men to come to the aid
of their country.
(py26_default)[mpenning@Bucksnort ~]$

score 10 · Accepted Answer

get_text似乎做你需要的

>>> from bs4 import BeautifulSoup
>>> doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
>>> soup = BeautifulSoup(doc)
>>> soup.get_text(separator="\n")
u'This is a paragraph.\nThis is another paragraph.'

score 5 · Accepted Answer

这是@Mike Pennington 的答案的python3 版本（它真的很有帮助），我做了一个垃圾重构。

def replace_with_newlines(element):
    text = ''
    for elem in element.recursiveChildGenerator():
        if isinstance(elem, str):
            text += elem.strip()
        elif elem.name == 'br':
            text += '\n'
    return text


def get_plain_text(soup):
    plain_text = ''
    lines = soup.find("body")
    for line in lines.findAll('p'):
        line = replace_with_newlines(line)
        plain_text+=line
    return plain_text

要使用它，只需将 Beautifulsoup 对象传递给 get_plain_text 方法。

soup = BeautifulSoup(page)
plain_text = get_plain_text(soup)

score 0 · Accepted Answer

我使用以下小型库来完成此操作：

https://github.com/TeamHG-Memex/html-text

pip install html-text

很简单：

>>> import html_text
>>> html_text.extract_text('<h1>Hello</h1> world!')
'Hello\n\nworld!'

score -6 · Accepted Answer

我不完全确定您要完成什么，但如果您只是想删除 HTML 元素，我会使用Notepad2之类的程序并使用 Replace All 功能 - 我认为您也可以插入新行也使用全部替换。确保如果您替换元素，您也删除了关闭（）。另外只是一个仅供参考，正确的 HTML5 不是， 但这并不重要。Python 不会是我的首选，所以它有点超出我的知识范围，抱歉我帮不上忙。

python - 我怎样才能转身换行？

5 回答 5

Related

Reference

python - 我怎样才能
转身
换行？