我在爬取过程的底部编写了一个具有以下两个功能的刮板。
def parse_summary(self, response):
hxs = HtmlXPathSelector(response)
item = response.meta['item']
soup = BeautifulSoup(hxs.select("//div[@class='PrimaryContent']").extract()[0])
text = soup.get_text()
item['main_summary'] = text
summary_links = hxs.select("//ul[@class='module_leftnav']/li/a/@href").extract()
chap_summary_links = [urljoin(response.url, link) for link in summary_links]
for link in chap_summary_links:
print 'yielding request to chapter summary.'
yield Request(link, callback=self.parse_chap_summary_link, meta={'item': item})
def parse_chap_summary_link(self, response):
hxs = HtmlXPathSelector(response)
item = response.meta['item']
item['chapter_summaries'] = hxs.select("//h1/text()").extract()
soup = BeautifulSoup(hxs.select("//div[@class='PrimaryContent']").extract()[0])
text = soup.get_text()
item['chapter_summaries'] += [text]
yield item
在底部parse_summary
,我发出请求以parse_chap_summary_link
从章节摘要页面中提取数据。这可行,但问题是输出给了我:
{item 1, [chapter 1 summary]}
{item 1, [chapter 2 summary]}
但我想要:
{item 1, [Chapter 1 summary, Chapter 2 Summary]}
{item 2, [Chapter 1 summary, Chapter 2 Summary, Chapter 3 etc etc]}
如何将所有章节摘要信息放入一个标题,而不是为每个章节摘要创建一个新项目?