我正在尝试使用 ScraperWiki 学习 Python 和 Beautiful Soup。我想要一份埃德蒙顿所有 kickstarter 项目的清单。
我已经成功地抓取了我正在寻找的页面并提取了我想要的数据。我无法将该数据格式化并导出到数据库。
控制台输出:
Line 42 - url = link["href"]
/usr/local/lib/python2.7/dist-packages/bs4/element.py:879 -- __getitem__((self=<h2 class="bbcard_nam...more
KeyError: 'href'
代码:
import scraperwiki
from bs4 import BeautifulSoup
search_page ="http://www.kickstarter.com/projects/search?term=edmonton"
html = scraperwiki.scrape(search_page)
soup = BeautifulSoup(html)
max = soup.find("p", { "class" : "blurb" }).get_text()
num = int(max.split(" ")[0])
if num % 12 != 0:
last_page = int(num/12) + 1
else:
last_page = int(num/12)
for n in range(1, last_page + 1):
html = scraperwiki.scrape(search_page + "&page=" + str(n))
soup = BeautifulSoup(html)
projects = soup.find_all("h2", { "class" : "bbcard_name" })
counter = (n-1)*12 + 1
print projects
for link in projects:
url = link["href"]
data = {"URL": url, "id": counter}
#save into the data store, giving the unique parameter
scraperwiki.sqlite.save(["URL"],data)
counter+=1
在项目中有锚点href
。如何从循环<h2>
中的每个元素获取 URL?for