python - Python 将 html 转换为文本并模仿格式

Question

我正在学习 BeautifulSoup，并找到了许多“html2text”解决方案，但我正在寻找的解决方案应该模仿格式：

<ul>
<li>One</li>
<li>Two</li>
</ul>

会成为

* One
* Two

和

Some text
<blockquote>
More magnificent text here
</blockquote>
Final text

至

Some text

    More magnificent text here

Final text

我正在阅读文档，但我没有看到任何直截了当的内容。有什么帮助吗？我愿意使用beautifulsoup以外的东西。

score 13 · Accepted Answer

看看 Aaron Swartz 的html2text脚本（可以安装pip install html2text）。请注意，输出是有效的Markdown。如果出于某种不完全适合您的原因，一些相当微不足道的调整应该可以为您提供问题中的确切输出：

In [1]: import html2text

In [2]: h1 = """<ul>
   ...: <li>One</li>
   ...: <li>Two</li>
   ...: </ul>"""

In [3]: print html2text.html2text(h1)
  * One
  * Two

In [4]: h2 = """<p>Some text
   ...: <blockquote>
   ...: More magnificent text here
   ...: </blockquote>
   ...: Final text</p>"""

In [5]: print html2text.html2text(h2)
Some text

> More magnificent text here

Final text

score 5 · Accepted Answer

我有一个更简单任务的代码：删除 HTML 标记，并在适当的位置插入换行符。也许这可以成为您的起点。

Python 的textwrap模块可能有助于创建缩进的文本块。

http://docs.python.org/2/library/textwrap.html

class HtmlTool(object):
    """
    Algorithms to process HTML.
    """
    #Regular expressions to recognize different parts of HTML. 
    #Internal style sheets or JavaScript 
    script_sheet = re.compile(r"<(script|style).*?>.*?(</\1>)", 
                              re.IGNORECASE | re.DOTALL)
    #HTML comments - can contain ">"
    comment = re.compile(r"<!--(.*?)-->", re.DOTALL) 
    #HTML tags: <any-text>
    tag = re.compile(r"<.*?>", re.DOTALL)
    #Consecutive whitespace characters
    nwhites = re.compile(r"[\s]+")
    #<p>, <div>, <br> tags and associated closing tags
    p_div = re.compile(r"</?(p|div|br).*?>", 
                       re.IGNORECASE | re.DOTALL)
    #Consecutive whitespace, but no newlines
    nspace = re.compile("[^\S\n]+", re.UNICODE)
    #At least two consecutive newlines
    n2ret = re.compile("\n\n+")
    #A return followed by a space
    retspace = re.compile("(\n )")

    #For converting HTML entities to unicode
    html_parser = HTMLParser.HTMLParser()

    @staticmethod
    def to_nice_text(html):
        """Remove all HTML tags, but produce a nicely formatted text."""
        if html is None:
            return u""
        text = unicode(html)
        text = HtmlTool.script_sheet.sub("", text)
        text = HtmlTool.comment.sub("", text)
        text = HtmlTool.nwhites.sub(" ", text)
        text = HtmlTool.p_div.sub("\n", text) #convert <p>, <div>, <br> to "\n"
        text = HtmlTool.tag.sub("", text)     #remove all tags
        text = HtmlTool.html_parser.unescape(text)
        #Get whitespace right
        text = HtmlTool.nspace.sub(" ", text)
        text = HtmlTool.retspace.sub("\n", text)
        text = HtmlTool.n2ret.sub("\n\n", text)
        text = text.strip()
        return text

代码中可能会留下一些多余的正则表达式。

score 4 · Accepted Answer

Python 的内置 html.parser（早期版本中的 HTMLParser）模块可以轻松扩展以创建一个简单的翻译器，您可以根据您的确切需求进行定制。当解析器遍历 HTML 时，它可以让您挂钩某些事件。

由于其简单的性质，您不能像使用 Beautiful Soup 那样在 HTML 树中导航（例如兄弟节点、子节点、父节点等），但对于像您这样的简单案例来说，它应该足够了。

html.parser 主页

在您的情况下，您可以通过在遇到特定类型的开始标签或结束标签时添加适当的格式来使用它：

from html.parser import HTMLParser
from os import linesep

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self, strict=False)
    def feed(self, in_html):
        self.output = ""
        super(MyHTMLParser, self).feed(in_html)
        return self.output
    def handle_data(self, data):
        self.output += data.strip()
    def handle_starttag(self, tag, attrs):
        if tag == 'li':
            self.output += linesep + '* '
        elif tag == 'blockquote' :
            self.output += linesep + linesep + '\t'
    def handle_endtag(self, tag):
        if tag == 'blockquote':
            self.output += linesep + linesep

parser = MyHTMLParser()
content = "<ul><li>One</li><li>Two</li></ul>"
print(linesep + "Example 1:")
print(parser.feed(content))
content = "Some text<blockquote>More magnificent text here</blockquote>Final text"
print(linesep + "Example 2:")
print(parser.feed(content))

score 0 · Accepted Answer

在使用 samaspin 的解决方案时，如果存在非英文 unicode 字符，则解析器将停止工作并仅返回一个空字符串。为每个循环初始化解析器可确保即使解析器对象损坏，它也不会为后续解析返回空字符串。添加到 samaspin 的解决方案中，<br>标签的处理也是如此。在处理 HTML 代码而不清理 html 标签方面，可以添加后续标签并将其预期输出写入函数handle_starttag

            class MyHTMLParser(HTMLParser):
            """
            This class will be used to clean the html tags whilst ensuring the
            format is maintained. Therefore all the whitespces, newlines, linebrakes, etc are
            converted from html tags to their respective counterparts in python.

            """

            def __init__(self):
                HTMLParser.__init__(self)

            def feed(self, in_html):
                self.output = ""
                super(MyHTMLParser, self).feed(in_html)
                return self.output

            def handle_data(self, data):
                self.output += data.strip()

            def handle_starttag(self, tag, attrs):
                if tag == 'li':
                    self.output += linesep + '* '
                elif tag == 'blockquote':
                    self.output += linesep + linesep + '\t'
                elif tag == 'br':
                    self.output += linesep + '\n'

            def handle_endtag(self, tag):
                if tag == 'blockquote':
                    self.output += linesep + linesep


        parser = MyHTMLParser()

python - Python 将 html 转换为文本并模仿格式

4 回答 4

Related

Reference