ruby - 带过滤器的 Nokogiri next_element

Question

假设我有一个格式不正确的 html 页面：

<table>
 <thead>
  <th class="what_I_need">Super sweet text<th>
 </thead>
 <tr>
  <td>
    I also need this
  </td>
  <td>
    and this (all td's in this and subsequent tr's)
  </td>
 </tr>
 <tr>
   ...all td's here too
 </tr>
 <tr>
   ...all td's here too
 </tr>
</table>

在 BeautifulSoup 上，我们能够得到<th>然后调用findNext("td"). Nokogiri 有next_element调用，但这可能不会返回我想要的（在这种情况下，它会返回tr元素）。

有没有办法过滤next_elementNokogiri 的电话？例如next_element("td")？

编辑

为了澄清起见，我将查看许多站点，其中大多数以不同的方式形成错误。

例如，下一个站点可能是：

<table>
 <th class="what_I_need">Super sweet text<th>
 <tr>
  <td>
    I also need this
  </td>
  <td>
    and this (all td's in this and subsequent tr's)
  </td>
 </tr>
 <tr>
   ...all td's here too
 </tr>
 <tr>
   ...all td's here too
 </tr>
</table>

我不能假设任何结构，除了在具有类的项目下面会有trswhat_I_need

score 2 · Accepted Answer

首先，请注意您的结束th标签格式错误：<th>. 应该是</th>。修复这有帮助。

th一种方法是在找到节点后使用 XPath 导航到它：

require 'nokogiri'

html = '
<table>
<thead>
  <th class="what_I_need">Super sweet text<th>
</thead>
<tr>
  <td>
    I also need this
  </td>
<tr>
</table>
'

doc = Nokogiri::HTML(html)

th = doc.at('th.what_I_need')
th.text # => "Super sweet text"
td = th.at('../../tr/td')
td.text # => "\n    I also need this\n  "

这利用了 Nokogiri 使用 CSS 访问器或 XPath 的能力，并且非常透明地进行。

拥有<th>节点后，您还可以使用 Node 的一些方法进行导航：

th.parent.next_element.at('td').text # => "\n    I also need this\n  "

另一种方法是从表格顶部开始向下看：

table = doc.at('table')
th = table.at('th')
th.text # => "Super sweet text"
td = table.at('td')
td.text # => "\n    I also need this\n  "

如果您需要访问<td>表中的所有标签，您可以轻松地遍历它们：

table.search('td').each do |td|
  # do something with the td...
  puts td.text
end

如果您希望<td>通过它们包含的所有内容<tr>遍历行，那么单元格：

table.search('tr').each do |tr|
  cells = tr.search('td').map(&:text)
  # do something with all the cells
end

ruby - 带过滤器的 Nokogiri next_element

1 回答 1

Related

Reference