html - 如何从报纸 3k 对象中删除不需要的类和标签？

Question

我想提取新闻文章内容，我目前正在使用报纸 3k库：

a = Article(url, memoize_articles=False, language='en')
a.download()
a.parse()
content = a.text

但是对于某些网站，图像中存在广告和文本等不需要的元素。所以我想删除那些不需要的元素和文本。有没有办法从这些标签和类中删除所有内容？

score 1 · Accepted Answer

如果你想针对某个特定的网站这样做，你可以使用 a.top_node，找出广告的 XPath 或 CSS 选择器，然后将它们删除。

ads = a.top_node.xpath("./foo")  # find a proper selector
for ad in ads:
    ad.getparent().remove(ad)

# and now conver top_node to text again somehow, probably using
# OutputFormatter

见https://github.com/codelucas/newspaper/blob/56de65af9efbfea6293c82c0b1821e2ca9fbddaa/newspaper/article.py#L281

也可以实现自定义DocumentCleaner并将此逻辑放在那里。

一般来说，这是一个很难的问题，可能是文章提取中最难的问题，如果你想以一种通用且健壮的方式来做，而不需要为每个网站编写和维护规则。开源库通常可以找到质量合理的主要内容，但它们在从文章正文中排除额外内容方面非常糟糕。请参阅https://github.com/scrapinghub/article-extraction-benchmark和https://github.com/scrapinghub/article-extraction-benchmark/releases/download/v1.0.0/paper-v1.0.0.pdf报告。

Scrapinghub 的AutoExtract等商业工具（我在那里工作）解决了这个问题；他们使用计算机视觉和机器学习，否则很难可靠地解决这个问题。

html - 如何从报纸 3k 对象中删除不需要的类和标签？

1 回答 1

Related

Reference