这是一个概念验证代码,可以让你的想法生效,只是为了让你知道,BeautifulSoup4 真的很强大,对于你第一阶段的抓取绝对足够。
您还需要阅读 CNN 的服务条款以检查是否允许抓取。您可以在 BS4 文档中找到下面代码的每一个细节的解释,或者您可以在 stackoverflow 开始您的职业生涯,从社区学习每一个细节,就像我所做的一样 :) 祝你好运并享受它!
from bs4 import BeautifulSoup, SoupStrainer
import urllib2
import re
def main():
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
url = 'http://www.cnn.com/2013/10/29/us/florida-shooting-cell-phone-blocks-bullet/index.html?hpt=ju_c2'
soup = BeautifulSoup(opener.open(url))
#1) Link to the website
#2) Date article published
date = soup.find("div", {"class":"cnn_strytmstmp"}).text.encode('utf-8')
#3) title of article
title = soup.find("div", {"id":"cnnContentContainer"}).find('h1').text.encode('utf-8')
#4) Text of the article
paragraphs = soup.find('div', {"class":"cnn_strycntntlft"}).find_all('p')
text = " ".join([ paragraph.text.encode('utf-8') for paragraph in paragraphs])
print url
print date
print title
print text
if __name__ == '__main__':
main()
输出如下所示:
http://www.cnn.com/2013/10/29/us/florida-shooting-cell-phone-blocks-bullet/index.html?hpt=ju_c2
updated 7:34 AM EDT, Tue October 29, 2013
Cell phone stops bullet aimed at Florida gas station clerk
(CNN) -- A gas station clerk's smartphone may... the TV station reported.
同时,我对我们应该如何定位元素提出了一点哲学:链接在这里。
还有 Selenium/Scrapy 你以后可能还会遇到..