65

我正在尝试使用 Python 将 html 块转换为文本。

输入:

<div class="body"><p><strong></strong></p>
<p><strong></strong>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. <a href="http://example.com/" target="_blank" class="source">Some Link</a> Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p></div>

期望的输出:

Lorem ipsum dolor sit amet,consectetuer adipiscing elit。Aenean commodo ligula eget dolor。埃涅马萨

Consectetuer adipiscing 精英。一些 Link Aenean commodo ligula eget dolor。埃涅马萨

Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit。Aenean commodo ligula eget dolor。埃涅马萨

Lorem ipsum dolor sit amet,consectetuer adipiscing elit。Aenean commodo ligula eget dolor。埃涅马萨

Consectetuer adipiscing 精英。Aenean commodo ligula eget dolor。埃涅马萨

我尝试了该html2text模块但没有取得多大成功:

#!/usr/bin/env python

import urllib2
import html2text
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen('http://example.com/page.html').read())

txt = soup.find('div', {'class' : 'body'})

print(html2text.html2text(txt))

txt对象生成上面的 html 块。我想将其转换为文本并在屏幕上打印。

4

13 回答 13

109

soup.get_text()输出你想要的:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print(soup.get_text())

输出:

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

要保留换行符:

print(soup.get_text('\n'))

为了与您的示例相同,您可以用两个换行符替换换行符:

soup.get_text().replace('\n','\n\n')
于 2013-02-04T20:06:25.120 回答
25

可以使用 python 标准html.parser

from html.parser import HTMLParser

class HTMLFilter(HTMLParser):
    text = ""
    def handle_data(self, data):
        self.text += data

f = HTMLFilter()
f.feed(data)
print(f.text)
于 2019-04-24T08:03:53.837 回答
6

您可以使用正则表达式,但不建议这样做。以下代码删除数据中的所有 HTML 标记,为您提供文本:

import re

data = """<div class="body"><p><strong></strong></p>
<p><strong></strong>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. <a href="http://example.com/" target="_blank" class="source">Some Link</a> Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p></div>"""

data = re.sub(r'<.*?>', '', data)

print(data)

输出

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
于 2013-02-04T20:02:51.347 回答
4

'\n'段落之间放置一个换行符。

from bs4 import Beautifulsoup

soup = Beautifulsoup(text)
print(soup.get_text('\n'))
于 2013-02-04T20:11:08.527 回答
3

我非常喜欢@FrBrGeorge 的无依赖答案,因此我将其扩展为仅提取body标签并添加了一个便捷方法,以便 HTML 到文本是一行:

from abc import ABC
from html.parser import HTMLParser


class HTMLFilter(HTMLParser, ABC):
    """
    A simple no dependency HTML -> TEXT converter.
    Usage:
          str_output = HTMLFilter.convert_html_to_text(html_input)
    """
    def __init__(self, *args, **kwargs):
        self.text = ''
        self.in_body = False
        super().__init__(*args, **kwargs)

    def handle_starttag(self, tag: str, attrs):
        if tag.lower() == "body":
            self.in_body = True

    def handle_endtag(self, tag):
        if tag.lower() == "body":
            self.in_body = False

    def handle_data(self, data):
        if self.in_body:
            self.text += data

    @classmethod
    def convert_html_to_text(cls, html: str) -> str:
        f = cls()
        f.feed(html)
        return f.text.strip()           

用法见评论。

这将转换 中的所有文本body,理论上可以包括stylescript标签。进一步的过滤可以通过扩展 的模式来实现body- 即设置实例变量in_stylein_script.

于 2020-06-03T18:45:35.420 回答
3

主要问题是如何保持一些基本格式。这是我自己保留新行和项目符号的最小方法。我确信这不是您想要保留的所有内容的解决方案,但它是一个起点:

from bs4 import BeautifulSoup

def parse_html(html):
    elem = BeautifulSoup(html, features="html.parser")
    text = ''
    for e in elem.descendants:
        if isinstance(e, str):
            text += e.strip()
        elif e.name in ['br',  'p', 'h1', 'h2', 'h3', 'h4','tr', 'th']:
            text += '\n'
        elif e.name == 'li':
            text += '\n- '
    return text


上面为元素添加了一个新行'br', 'p', 'h1', 'h2', 'h3', 'h4','tr', 'th' 和一个新行-li

于 2021-03-18T11:57:55.650 回答
2

这里有一些不错的东西,我不妨提出我的解决方案:

from html.parser import HTMLParser
def _handle_data(self, data):
    self.text += data + '\n'

HTMLParser.handle_data = _handle_data

def get_html_text(html: str):
    parser = HTMLParser()
    parser.text = ''
    parser.feed(html)

    return parser.text.strip()
于 2020-09-15T09:50:36.083 回答
1

我需要一种在客户端系统上执行此操作的方法,而无需下载其他库。我从来没有找到一个好的解决方案,所以我创建了自己的解决方案。如果您愿意,请随意使用它。

import urllib 

def html2text(strText):
    str1 = strText
    int2 = str1.lower().find("<body")
    if int2>0:
       str1 = str1[int2:]
    int2 = str1.lower().find("</body>")
    if int2>0:
       str1 = str1[:int2]
    list1 = ['<br>',  '<tr',  '<td', '</p>', 'span>', 'li>', '</h', 'div>' ]
    list2 = [chr(13), chr(13), chr(9), chr(13), chr(13),  chr(13), chr(13), chr(13)]
    bolFlag1 = True
    bolFlag2 = True
    strReturn = ""
    for int1 in range(len(str1)):
      str2 = str1[int1]
      for int2 in range(len(list1)):
        if str1[int1:int1+len(list1[int2])].lower() == list1[int2]:
           strReturn = strReturn + list2[int2]
      if str1[int1:int1+7].lower() == '<script' or str1[int1:int1+9].lower() == '<noscript':
         bolFlag1 = False
      if str1[int1:int1+6].lower() == '<style':
         bolFlag1 = False
      if str1[int1:int1+7].lower() == '</style':
         bolFlag1 = True
      if str1[int1:int1+9].lower() == '</script>' or str1[int1:int1+11].lower() == '</noscript>':
         bolFlag1 = True
      if str2 == '<':
         bolFlag2 = False
      if bolFlag1 and bolFlag2 and (ord(str2) != 10) :
        strReturn = strReturn + str2
      if str2 == '>':
         bolFlag2 = True
      if bolFlag1 and bolFlag2:
        strReturn = strReturn.replace(chr(32)+chr(13), chr(13))
        strReturn = strReturn.replace(chr(9)+chr(13), chr(13))
        strReturn = strReturn.replace(chr(13)+chr(32), chr(13))
        strReturn = strReturn.replace(chr(13)+chr(9), chr(13))
        strReturn = strReturn.replace(chr(13)+chr(13), chr(13))
    strReturn = strReturn.replace(chr(13), '\n')
    return strReturn


url = "http://www.theguardian.com/world/2014/sep/25/us-air-strikes-islamic-state-oil-isis"    
html = urllib.urlopen(url).read()    
print html2text(html)
于 2014-09-25T20:47:56.077 回答
1

可以使用 BeautifulSoup 删除不需要的脚本和类似内容,但您可能需要尝试几个不同的站点以确保您已经涵盖了您希望排除的不同类型的内容。试试这个:

from requests import get
from bs4 import BeautifulSoup as BS
response = get('http://news.bbc.co.uk/2/hi/health/2284783.stm')
soup = BS(response.content, "html.parser")
for child in soup.body.children:
   if child.name == 'script':
       child.decompose() 
print(soup.body.get_text())
于 2017-12-12T22:58:43.577 回答
1

西班牙凉菜汤可能是一个不错的选择!

输入:

from gazpacho import Soup

html = """\
<div class="body"><p><strong></strong></p>
<p><strong></strong>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. <a href="http://example.com/" target="_blank" class="source">Some Link</a> Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p></div>
"""

输出:

text = Soup(html).strip(whitespace=False) # to keep "\n" characters intact
print(text)
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
于 2020-10-09T20:38:16.607 回答
0

一种基于两步lxml的方法,在转换为纯文本之前对标记进行清理。

该脚本接受 HTML 文件的路径或管道标准输入。

将删除脚本块和所有可能不需要的文本。您可以配置lxml Cleaner实例以满足您的需要。

#!/usr/bin/env python3

import sys
from lxml import html
from lxml.html import tostring
from lxml.html.clean import Cleaner


def sanitize(dirty_html):
    cleaner = Cleaner(page_structure=True,
                  meta=True,
                  embedded=True,
                  links=True,
                  style=True,
                  processing_instructions=True,
                  inline_style=True,
                  scripts=True,
                  javascript=True,
                  comments=True,
                  frames=True,
                  forms=True,
                  annoying_tags=True,
                  remove_unknown_tags=True,
                  safe_attrs_only=True,
                  safe_attrs=frozenset(['src','color', 'href', 'title', 'class', 'name', 'id']),
                  remove_tags=('span', 'font', 'div')
                  )

    return cleaner.clean_html(dirty_html)


if len(sys.argv) > 1:
  fin = open(sys.argv[1], encoding='utf-8')
else:
  fin = sys.stdin

source = fin.read()
source = sanitize(source)
source = source.replace('<br>', '\n')

tree = html.fromstring(source)
plain = tostring(tree, method='text', encoding='utf-8')

print(plain.decode('utf-8'))
于 2021-10-25T13:48:46.503 回答
0

我个人喜欢emehex的 Gazpacho 解决方案,但它只使用正则表达式来过滤掉标签。没有更多的魔法。这意味着解决方案将文本保留在 <style> 和 <script> 中。

所以我宁愿实现一个基于正则表达式的简单解决方案,并使用标准 Python 3.4 库来转义 HTML 实体:

import re
from html import unescape

def html_to_text(html):

    # use non-greedy for remove scripts and styles
    text = re.sub("<script.*?</script>", "", html, flags=re.DOTALL)
    text = re.sub("<style.*?</style>", "", text, flags=re.DOTALL)

    # remove other tags
    text = re.sub("<[^>]+>", " ", text)

    # strip whitespace
    text = " ".join(text.split())

    # unescape html entities
    text = unescape(text)

    return text

当然,这并不能证明作为 BeautifulSoup 或其他解析器解决方案的错误。但是您不需要任何 3rd 方包。

于 2021-10-29T11:39:35.390 回答
0
from html.parser import HTMLParser

class HTMLFilter(HTMLParser):
    text = ''
    def handle_data(self, data):
        self.text += f'{data}\n'

def html2text(html):
    filter = HTMLFilter()
    filter.feed(html)

    return filter.text

content = html2text(content_temp)
于 2022-01-18T08:02:13.380 回答