0

在这个网址(https://edition.cnn.com/search/?q=%20news&size=10&from=5540&page=555

我的目的是获取所有新闻列表

在 urls html code(contain news的 url 中)

 <div class="cnn-search__result-thumbnail">         
 <a href="https://www.cnn.com/2018/03/27/asia/north-korea-kim-jong-un-china- visit/index.html">
   <img src="./Search CNN - Videos, Pictures, and News - 
      CNN.com_files/180328104116china-xi-kim-story-body.jpg">
   </a> 

无法获取 url 的新闻列表

https://edition.cnn.com/search/?q=%20news&size=10&from=5550&page=556 `s 链接

https://edition.cnn.com/search/?q=%20news&size=10&from=5560&page=557 `s 链接是一样的

我的源代码

def freeze_support():
 '''
 Check whether this is a fake forked process in a frozen executable.
 If so then run code specified by commandline and exit.
 '''
 if sys.platform == 'win32' and getattr(sys, 'frozen', False):
     from multiprocessing.forking import freeze_support
     freeze_support()
if __name__ == '__main__':
  freeze_support()
  for x in range(1, 6000):
    url = "https://edition.cnn.com/search/?q=%20news&size=10&from=" + str(x * 10) + "&page=" + str(x + 1)
    cnn_paper = newspaper.build(url, memoize_articles=False)  # ~15 seconds
    print(len(cnn_paper.articles))
    list = []
    for article in cnn_paper.articles:
        if article.url not in url_list:
            list.append(article.url)
4

0 回答 0