python - Newspaper3k 的缺点：如何只抓取文章 HTML？Python

Question

您好，非常感谢您的帮助，

我一直在使用 Python 和 Newspaper3k 来抓取网站，但我注意到有些函数……嗯……没有功能。特别是，我只能抓取大约 1/10 甚至更少网站的文章 HTML。这是我的代码：

from newspaper import Article
url = pageurl.com
article = Article(url, keep_article_html = True, language ='en')
article.download()
article.parse()
print(article.title + "\n" + article.article_html)

发生的情况是，根据我的经验，100% 的时间都抓取了文章标题，但几乎没有成功抓取文章 HTML ，并且没有返回任何内容。我知道 Newspaper3k 是基于 BeautifulSoup 的，所以我不希望它也能工作并且有点卡住了。有任何想法吗？

编辑：我尝试抓取的大多数网站都是西班牙语

score 1 · Accepted Answer

所以我发现用beautifulsoup 抓取wellness-spain.com 并没有太大问题。该网站没有那么多javascript。这可能会导致 HTML 解析器（如 beautifulsoup）出现问题，因此在抓取网站时应注意关闭 javascript 以查看在抓取之前从浏览器获得的输出。

您没有指定您需要该网站的哪些数据，所以我进行了有根据的猜测。

编码示例

import requests 
from bs4 import BeautifulSoup

url = 'http://www.wellness-spain.com/-/estres-produce-acidez-en-el-organismo-principal-causa-de-enfermedades#:~:text=Con%20respecto%20al%20factor%20emocional,produce%20acidez%20en%20el%20organismo'
html = requests.get(url)
soup = BeautifulSoup(html.text,'html.parser')
title = soup.select_one('h1.header-title > span').get_text().strip()
sub_title = soup.select_one('div.journal-content-article > h2').get_text()
author = soup.select_one('div.author > p').get_text().split(':')[1].strip()

代码说明

我们对请求使用 get 方法来获取 HTTP 响应。美丽的汤，需要那个回应.text。你会经常看到html.content，但这是二元响应，所以不要使用它。HTML解析器只是beautifulsoup用来正确解析html的解析器。

然后我们使用 CSS 选择器来选择您想要的数据。在我们使用的变量标题中select_one，它将仅选择元素列表中的一个，因为有时您的 CSS 选择器会为您提供 HTML 标记列表。如果您不了解 CSS 选择器，这里有一些资源。

本质上，在 title 变量中我们指定了 html 标签，.表示类名，因此h1.header-title将使用类 header-title 获取 html 标签 h1。将>您引向 h1 的直接子元素，在这种情况下，我们需要作为 H1 子元素的 span 元素。

同样在标题变量中，我们有get_text()方法从 html 标记中获取文本。然后我们使用 string strip 方法去除字符串中的空格。

与 sub_title 变量类似，我们获取类名为 journal-content-article 的 div 元素，我们获取直接子 html 标记 h2 并获取它的文本。

作者变量，我们选择类名作者的 div 并获取直接子 p 标签。我们正在抓取文本，但底层文本autor: NAME使用拆分字符串方法，我们将该字符串拆分为两个元素的列表，autor然后NAME，我选择了该列表的第二个元素，然后使用字符串方法剥离，剥离任何白色空间。

如果您在抓取特定网站时遇到问题，最好提出一个新问题并向我们展示您尝试过的代码，您的特定数据需求是什么，尝试尽可能明确。该 URL 可帮助我们指导您让您的刮刀正常工作。

score 0 · Accepted Answer

您需要使用 Config() 类来提取文章 HTML。这是执行此操作的完整代码。

import lxml
from newspaper import Article, Config


def extract_article_html(url):
    config = Config()
    config.fetch_images = True
    config.request_timeout = 30
    config.keep_article_html = True
    article = Article(url, config=config)

    article.download()
    article.parse()

    article_html = article.article_html

    html = lxml.html.fromstring(article_html)
    for tag in html.xpath('//*[@class]'):
        tag.attrib.pop('class')

    return lxml.html.tostring(html).decode('utf-8')


url = 'https://www.stackoverflow.com'
print(url, extract_article_html(url))

python - Newspaper3k 的缺点：如何只抓取文章 HTML？Python

2 回答 2

编码示例

代码说明

Related

Reference