timestamp - Python：看到报纸3k 提供的文章的时间戳？

Question

当我做

import newspaper
cnn_paper = newspaper.build(news_source_url, memoize_articles=False)
for article in cnn_paper.articles:
    print(article.url)
exit()

我获得了可以使用该软件包从news_source_url（例如，）下载的文章的 URL 列表。有没有办法获取各种文章的时间戳？'http://cnn.com'newspaper3k

特别是对于 CNN，日期似乎编码在许多文章的 URL 中，但我想获取任何新闻来源的文章时间戳。如果可能的话，我想同时获得日期和时间。

score 1 · Accepted Answer

您可以使用以下代码使用Newspaper提取文章的发布日期。我重新格式化了日期输出，因为它们有 00:00:00 时间戳。

import newspaper
from datetime import datetime

cnn_paper = newspaper.build('http://cnn.com', memoize_articles=False)
for item in cnn_paper.articles:
  article = newspaper.Article(item.url)
  article.download()
  article.parse()
  if article.url and article.publish_date is not None:
    print(article.url)
    publish_date = datetime.strptime(str(article.publish_date), '%Y-%m-%d %H:%M:%S').strftime('%Y-%m-%d')
    print(publish_date)

如果您需要文章的确切发布日期和时间戳，那么您需要从文章的 URL 中获取这些日期。查看Newspaper的代码后，我发现了一个元标记提取器。

import newspaper

cnn_paper = newspaper.build('http://cnn.com', memoize_articles=False)
for item in cnn_paper.articles:
   article = newspaper.Article(item.url)
   article.download()
   article.parse()
   if article.url and article.publish_date is not None:
     article_meta_data = article.meta_data
     article_published_date = sorted({value for (key, value) in article_meta_data.items() if key == 'pubdate'})
     if article_published_date:
        print(article_published_date)
     else:
        print('no published date provided')

timestamp - Python：看到报纸3k 提供的文章的时间戳？

1 回答 1

Related

Reference