xml - scrapy中的xpath父级和后代

Question

我正在使用代码

response.xpath("//*[contains(text(), 'Role')]/parent/parent/descendant::td//text()").extract()

从以下 html 表中找到单词“角色”的行中选择所有 td text() 内容：

<table class="wh_preview_detail" border="1">
   <tr>
      <th colspan="3">
         <span class="wh_preview_detail_heading">Names</span>
      </th>
   </tr>
   <tr>
      <th>Role</th>
      <th>Name No</th>
      <th>Name</th>
   </tr>
   <tr>
      <td>Requestor</td>
      <td>589528</td>
      <td>John</td>
   </tr>
   <tr>
      <td>Helper</td>
      <td>589528</td>
      <td>Mary</td>
   </tr>
</table>

“角色”关键字仅充当表的标识符。

在这种情况下，我期待结果：

['Requestor', '589528', 'John', ...]

但是，在scrapy 中执行时，我得到一个空数组。

我的目标是最终将元素再次分组为记录。我花了几个小时尝试其他人的示例并在终端和 Chrome 中进行试验，但现在除了“简单”的 XPath 之外，所有的一切都超出了我的范围。我希望了解 Xpath，因此理想情况下希望得到一个带有解释的概括性答案，这样我就可以学习并分享。非常感谢你。

score 4 · Accepted Answer

作为一般建议，通常更容易通过逐步向下树来制作 XPath 表达式，而不是//typeiwant一直选择向下，然后为树中的先前内容添加谓词（使用父级或祖先）

让我们看看如何使用 Scrapy 选择器解决您的用例：

>>> import scrapy
>>> t = '''<table class="wh_preview_detail" border="1">
...    <tr>
...       <th colspan="3">
...          <span class="wh_preview_detail_heading">Names</span>
...       </th>
...    </tr>
...    <tr>
...       <th>Role</th>
...       <th>Name No</th>
...       <th>Name</th>
...    </tr>
...    <tr>
...       <td>Requestor</td>
...       <td>589528</td>
...       <td>John</td>
...    </tr>
...    <tr>
...       <td>Helper</td>
...       <td>589528</td>
...       <td>Mary</td>
...    </tr>
... </table>'''
>>> selector = scrapy.Selector(text=t, type="html")
>>>
>>> # what you want comes inside a <table>,
>>> # after a <tr> that has a child `<th>` with text "Role" inside
>>> selector.xpath('//table/tr[th[1]="Role"]')
[<Selector xpath='//table/tr[th[1]="Role"]' data=u'<tr>\n      <th>Role</th>\n      <th>Name '>]
>>>
>>> # check with .extract() is that's the one...
>>> selector.xpath('//table/tr[th[1]="Role"]').extract()
[u'<tr>\n      <th>Role</th>\n      <th>Name No</th>\n      <th>Name</th>\n   </tr>']
>>>

然后，您感兴趣的行<tr>与具有“角色”的行处于同一树级别。在 XPath 术语中，这些<tr>元素沿following-sibling轴

>>> for row in selector.xpath('//table/tr[th[1]="Role"]/following-sibling::tr'):
...     print('------')
...     print(row.extract())
... 
------
<tr>
      <td>Requestor</td>
      <td>589528</td>
      <td>John</td>
   </tr>
------
<tr>
      <td>Helper</td>
      <td>589528</td>
      <td>Mary</td>
   </tr>
>>>

所以你有每一行，每一行有 3 个单元格，映射到 3 个字段：

>>> for row in selector.xpath('//table/tr[th[1]="Role"]/following-sibling::tr'):
...     print({
...         "role": row.xpath('normalize-space(./td[1])').extract_first(),
...         "number": row.xpath('normalize-space(./td[2])').extract_first(),
...         "name": row.xpath('normalize-space(./td[3])').extract_first(),
...     })
... 
{'role': u'Requestor', 'number': u'589528', 'name': u'John'}
{'role': u'Helper', 'number': u'589528', 'name': u'Mary'}
>>>

xml - scrapy中的xpath父级和后代

1 回答 1

Related

Reference