作为一般建议,通常更容易通过逐步向下树来制作 XPath 表达式,而不是//typeiwant
一直选择向下,然后为树中的先前内容添加谓词(使用父级或祖先)
让我们看看如何使用 Scrapy 选择器解决您的用例:
>>> import scrapy
>>> t = '''<table class="wh_preview_detail" border="1">
... <tr>
... <th colspan="3">
... <span class="wh_preview_detail_heading">Names</span>
... </th>
... </tr>
... <tr>
... <th>Role</th>
... <th>Name No</th>
... <th>Name</th>
... </tr>
... <tr>
... <td>Requestor</td>
... <td>589528</td>
... <td>John</td>
... </tr>
... <tr>
... <td>Helper</td>
... <td>589528</td>
... <td>Mary</td>
... </tr>
... </table>'''
>>> selector = scrapy.Selector(text=t, type="html")
>>>
>>> # what you want comes inside a <table>,
>>> # after a <tr> that has a child `<th>` with text "Role" inside
>>> selector.xpath('//table/tr[th[1]="Role"]')
[<Selector xpath='//table/tr[th[1]="Role"]' data=u'<tr>\n <th>Role</th>\n <th>Name '>]
>>>
>>> # check with .extract() is that's the one...
>>> selector.xpath('//table/tr[th[1]="Role"]').extract()
[u'<tr>\n <th>Role</th>\n <th>Name No</th>\n <th>Name</th>\n </tr>']
>>>
然后,您感兴趣的行<tr>
与具有“角色”的行处于同一树级别。在 XPath 术语中,这些<tr>
元素沿following-sibling
轴
>>> for row in selector.xpath('//table/tr[th[1]="Role"]/following-sibling::tr'):
... print('------')
... print(row.extract())
...
------
<tr>
<td>Requestor</td>
<td>589528</td>
<td>John</td>
</tr>
------
<tr>
<td>Helper</td>
<td>589528</td>
<td>Mary</td>
</tr>
>>>
所以你有每一行,每一行有 3 个单元格,映射到 3 个字段:
>>> for row in selector.xpath('//table/tr[th[1]="Role"]/following-sibling::tr'):
... print({
... "role": row.xpath('normalize-space(./td[1])').extract_first(),
... "number": row.xpath('normalize-space(./td[2])').extract_first(),
... "name": row.xpath('normalize-space(./td[3])').extract_first(),
... })
...
{'role': u'Requestor', 'number': u'589528', 'name': u'John'}
{'role': u'Helper', 'number': u'589528', 'name': u'Mary'}
>>>