是的,每次我抓取链接时,我都必须使用 urlparse.urljoin 方法。
def parse(self, response):
hxs = HtmlXPathSelector(response)
urls = hxs.select('//a[contains(@href, "content")]/@href').extract() ## only grab url with content in url name
for i in urls:
yield Request(urlparse.urljoin(response.url, i[1:]),callback=self.parse_url)
我想你试图抓住整个 url 来解析它吗?如果是这种情况,一个简单的两种方法系统就可以在 basespider 上运行。parse 方法找到链接,将其发送到 parse_url 方法,该方法将您提取的内容输出到管道
def parse(self, response):
hxs = HtmlXPathSelector(response)
urls = hxs.select('//a[contains(@href, "content")]/@href').extract() ## only grab url with content in url name
for i in urls:
yield Request(urlparse.urljoin(response.url, i[1:]),callback=self.parse_url)
def parse_url(self, response):
hxs = HtmlXPathSelector(response)
item = ZipgrabberItem()
item['zip'] = hxs.select("//div[contains(@class,'odd')]/text()").extract() ## this grabs it
return item