python - Python报包返回哪些文章？

Question

我的基本问题是 Python 中的报纸包如何确定它返回的网址/文章？有人会认为它只是返回您提供的 url 中包含的所有文章链接，但它似乎并没有那样工作。例如，如果您使用“ http://www.cnn.com ”和“ https://www.cnn.com/politics ”，您会得到完全相同的文章返回。我认为对于后者，您应该只在政治页面上获得文章，但情况似乎并非如此。

那么它实际上在做什么呢？它只是从主页获取所有文章吗？

这是我用来测试的一个例子（我使用了python 3.6.2版）：

import newspaper

#Build newspaper on cnn homepage
url = "http://www.cnn.com"
paper = newspaper.build(url, memoize_articles=False)
article_list = []
for article in paper.articles:
    article_list.append(article.url)

#Build newspaper on cnn politics page
url = "https://www.cnn.com/politics"
paper = newspaper.build(url, memoize_articles=False)
article_list_2 = []
for article in paper.articles:
    article_list_2.append(article.url)

#print the total number of urls returned
print (str(len(article_list)))
print (str(len(article_list_2)))

score 2 · Accepted Answer

用于文章抓取和管理的 Python 报纸包仅返回主页文章。

import newspaper
news_paper = newspaper.build('http://nypost.com', memoize_articles=False)
print(news_paper.size())
for article in news_paper.articles:
    print(article.url)

它将打印主页的所有文章 url。我还为 CNN ' https://edition.cnn.com ' 测试了它。

python - Python报包返回哪些文章？

1 回答 1

Related

Reference