我正在尝试使用以下格式抓取 xml 文件
文件样本.xml:
<rss version="2.0">
<channel>
<item>
<title>SENIOR BUDGET ANALYST (new)</title>
<link>https://hr.example.org/psp/hrapp&SeqId=1</link>
<pubDate>Wed, 18 Jul 2012 04:00:00 GMT</pubDate>
<category>All Open Jobs</category>
</item>
<item>
<title>BUDGET ANALYST (healthcare)</title>
<link>https://hr.example.org/psp/hrapp&SeqId=2</link>
<pubDate>Wed, 18 Jul 2012 04:00:00 GMT</pubDate>
<category>All category</category>
</item>
</channel>
</rss>
下面是我的spider.py代码
class TestSpider(XMLFeedSpider):
name = "testproject"
allowed_domains = {"www.example.com"}
start_urls = [
"https://www.example.com/hrapp/rss/careers_jo_rss.xml"
]
iterator = 'iternodes'
itertag = 'channel'
def parse_node(self, response, node):
title = node.select('item/title/text()').extract()
link = node.select('item/link/text()').extract()
pubdate = node.select('item/pubDate/text()').extract()
category = node.select('item/category/text()').extract()
item = TestprojectItem()
item['title'] = title
item['link'] = link
item['pubdate'] = pubdate
item['category'] = category
return item
结果:
2012-07-25 13:24:14+0530 [testproject] DEBUG: Scraped from <200 https://hr.templehealth.org/hrapp/rss/careers_jo_rss.xml>
{'title': [u'SENIOR BUDGET ANALYST (hospital/healthcare)',
u'BUDGET ANALYST'],
'link': [u'https://hr.example.org/psp/hrapp&SeqId=1',
u'https://hr.example.org/psp/hrapp&SeqId=2']
'pubdate': [u'Wed, 18 Jul 2012 04:00:00 GMT',
u'Wed, 18 Jul 2012 04:00:00 GMT']
'category': [u'All Open Jobs',
u'All category']
}
在这里,您可以从上面的结果中观察到,来自相应标签的所有结果都合并到单个列表中,但是我想根据它们各自的项目标签进行映射,如下所示,就像我们为 html 抓取所做的那样。
{'title': u'SENIOR BUDGET ANALYST (hospital/healthcare)'
'link': u'https://hr.example.org/psp/hrapp&SeqId=1'
'pubdate': u'Wed, 18 Jul 2012 04:00:00 GMT'
'category': u'All Open Jobs'
}
{'title': u'BUDGET ANALYST'
'link': u'https://hr.example.org/psp/hrapp&SeqId=2'
'pubdate': u'Wed, 18 Jul 2012 04:00:00 GMT'
'category': u'All category'
}
我们如何根据上面的 item 标记等单独的主标记来抓取 xml 标记数据。
提前致谢.............