python - 用于从格式错误的 html 页面中提取文本的 Python 策略

Question

我正在尝试从任意 html 页面中提取文本。一些页面（我无法控制）有格式错误的 html 或脚本，这使得这很困难。此外，我在共享托管环境中，所以我可以安装任何 python 库，但我不能只在服务器上安装我想要的任何东西。

pyparsing 和 html2text.py 似乎也不适用于格式错误的 html 页面。

示例 URL 为http://apnews.myway.com/article/20091015/D9BB7CGG1.html

我目前的实现大致如下：

# Try using BeautifulSoup 3.0.7a
soup = BeautifulSoup.BeautifulSoup(s) 
comments = soup.findAll(text=lambda text:isinstance(text,Comment))
[comment.extract() for comment in comments]
c=soup.findAll('script')
for i in c:
    i.extract()    
body = bsoup.body(text=True)
text = ''.join(body) 
# if BeautifulSoup  can't handle it, 
# alter html by trying to find 1st instance of  "<body" and replace everything prior to that, with "<html><head></head>"
# try beautifulsoup again with new html

如果 beautifulsoup 仍然不起作用，那么我求助于使用启发式方法来查看第一个字符，最后一个字符（看看它们是否看起来像它的代码行 # < ; 并对该行进行采样，然后检查令牌是英文单词或数字。如果很少有标记是单词或数字，那么我猜该行是代码。

我可以使用机器学习来检查每一行，但这似乎有点贵，而且我可能必须对其进行训练（因为我对无监督学习机器了解不多），当然也要编写它。

任何建议、工具、策略都将受到欢迎。我还意识到，后半部分相当混乱，因为如果我得到一个确定包含代码的行，我目前会丢弃整行，即使该行中有少量实际的英文文本。

score 5 · Accepted Answer

尽量不要笑，但是：

class TextFormatter:
    def __init__(self,lynx='/usr/bin/lynx'):
        self.lynx = lynx

    def html2text(self, unicode_html_source):
        "Expects unicode; returns unicode"
        return Popen([self.lynx, 
                      '-assume-charset=UTF-8', 
                      '-display-charset=UTF-8', 
                      '-dump', 
                      '-stdin'], 
                      stdin=PIPE, 
                      stdout=PIPE).communicate(input=unicode_html_source.encode('utf-8'))[0].decode('utf-8')

我希望你有猞猁！

score 0 · Accepted Answer

好吧，这取决于解决方案必须有多好。我有一个类似的问题，将数百个旧 html 页面导入一个新网站。我基本上做到了

# remove all that crap around the body and let BS fix the tags
newhtml = "<html><body>%s</body></html>" % (
    u''.join( unicode( tag ) for tag in BeautifulSoup( oldhtml ).body.contents ))
# use html2text to turn it into text
text = html2text( newhtml )

并且它成功了，但当然文件可能非常糟糕，以至于即使是 BS 也无法挽救太多。

score 0 · Accepted Answer

BeautifulSoup 会对格式错误的 HTML 造成不良影响。一些正则表达式呢？

>>> import re
>>> 
>>> html = """<p>This is paragraph with a bunch of lines
... from a news story.</p>"""
>>> 
>>> pattern = re.compile('(?<=p>).+(?=</p)', re.DOTALL)
>>> pattern.search(html).group()
'This is paragraph with a bunch of lines\nfrom a news story.'

然后，您可以组装要从中提取信息的有效标签列表。

python - 用于从格式错误的 html 页面中提取文本的 Python 策略

3 回答 3

Related

Reference