python - 从存储的 .html 页面中提取新闻文章内容

Question

我正在从 html 文件中读取文本并进行一些分析。这些 .html 文件是新闻文章。

代码：

 html = open(filepath,'r').read()
 raw = nltk.clean_html(html)  
 raw.unidecode(item.decode('utf8'))

现在我只想要文章内容而不是广告、标题等其余文本。我怎样才能在 python 中相对准确地做到这一点？

我知道一些工具，如 Jsoup（一个 java api）和bolier，但我想在 python 中这样做。我可以找到一些使用bs4的技术，但仅限于一种类型的页面。我有来自众多来源的新闻页面。此外，缺乏任何示例代码示例。

我在 python 中寻找与http://www.psl.cs.columbia.edu/wp-content/uploads/2011/03/3463-WWWJ.pdf完全一样的东西。

编辑： 为了更好地理解，请编写示例代码以提取以下链接的内容http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always -a-high-cancer-risk.html?src=me&ref=general

score 16 · Accepted Answer

报纸越来越流行了，我用的只是肤浅的，但看起来还不错。它只是 Python 3。

快速入门仅显示从 URL 加载，但您可以从 HTML 字符串加载：

import newspaper

# LOAD HTML INTO STRING FROM FILE...

article = newspaper.Article('') # STRING REQUIRED AS `url` ARGUMENT BUT NOT USED
article.set_html(html)

score 12 · Accepted Answer

Python 中也有这方面的库 :)

由于您提到了 Java，因此有一个用于锅炉管的 Python 包装器，可让您直接在 Python 脚本中使用它：https ://github.com/misja/python-boilerpipe

如果你想使用纯 python 库，有 2 个选项：

https://github.com/buriy/python-readability

和

https://github.com/grangier/python-goose

在这两者中，我更喜欢 Goose，但是请注意，它的最新版本有时会由于某种原因无法提取文本（我的建议是现在使用 1.0.22 版本）

编辑：这是使用 Goose 的示例代码：

from goose import Goose
from requests import get

response = get('http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general')
extractor = Goose()
article = extractor.extract(raw_html=response.content)
text = article.cleaned_text

score 1 · Accepted Answer

通过直接访问页面尝试这样的事情：

##Import modules
from bs4 import BeautifulSoup
import urllib2


##Grab the page
url = http://www.example.com
req = urllib2.Request(url)
page = urllib2.urlopen(req)
content = page.read()
page.close()  

##Prepare
soup = BeautifulSoup(content) 

##Parse (a table, for example)

for link in soup.find_all("table",{"class":"myClass"}):
    ...do something...
pass

如果要加载文件，只需将抓取页面的部分替换为文件即可。在此处了解更多信息：http ://www.crummy.com/software/BeautifulSoup/bs4/doc/

score 1 · Accepted Answer

有很多方法可以在 Python 中组织 html-scaraping。正如其他答案中所说，工具＃1是BeautifulSoup，但还有其他：

以下是有用的资源：

没有找到文章内容的通用方法。HTML5 有文章标签，提示正文，也许可以调整从特定发布系统获取页面的抓取，但没有通用的方法来获得准确猜测的文本位置。（理论上，机器可以通过查看多个结构相同、内容不同的文章来推断页面结构，但这可能超出了这里的范围。）

使用 Python 进行 Web 抓取也可能是相关的。

纽约时报的 Pyquery 示例：

from pyquery import PyQuery as pq
url = 'http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general'
d = pq(url=url)
text = d('.story-content').text()

score 0 · Accepted Answer

您可以使用htmllib或HTMLParser您可以使用它们来解析您的 html 文件

from HTMLParser import HTMLParser

# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print "Encountered a start tag:", tag
    def handle_endtag(self, tag):
        print "Encountered an end tag :", tag
    def handle_data(self, data):
        print "Encountered some data  :", data

# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
            '<body><h1>Parse me!</h1></body></html>')

从 HTMLParser 页面获取的代码示例

python - 从存储的 .html 页面中提取新闻文章内容

5 回答 5

Related

Reference