我正在尝试使用 scrapy 抓取 craigslist 并已成功获取 url 但现在我想从 url 的页面内提取数据。以下是代码:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist.items import CraigslistItem
class craigslist_spider(BaseSpider):
name = "craigslist_unique"
allowed_domains = ["craiglist.org"]
start_urls = [
"http://sfbay.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addFour=part-time",
"http://newyork.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addThree=internship",
"http://seattle.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addFour=part-time"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select("//span[@class='pl']")
items = []
for site in sites:
item = CraigslistItem()
item['title'] = site.select('a/text()').extract()
item['link'] = site.select('a/@href').extract()
#item['desc'] = site.select('text()').extract()
items.append(item)
hxs = HtmlXPathSelector(response)
#print title, link
return items
我是scrapy的新手,无法弄清楚如何实际点击url(href)并在该url的页面中获取数据并为所有url执行此操作。