html - ruby nokogiri 提取表跨多个页面/合并连续表

Question

我需要从 html 的一个部分中提取第一个表（不是第一个表标记中的材料）。表格可能分布在多个页面中，因此它可能位于多个表格标签下。该部分中可能有多个表。我的逻辑是，如果表格标签之间有文本节点，那么它们就是不同的表格。如果表标签之间没有文本节点，则它们是一个表的一部分。我该如何实施？

我没有使用 xpath 来查找第一个表，因为我需要首先通过使用 reg exp 检查每个文本节点来识别适当的部分。

html='<body>
<table border="1">
<tr>
<td>row 1, cell 1</td>
<td>row 1, cell 2</td>
</tr>
<tr>
<td>row 2, cell 1</td>
<td>row 2, cell 2</td>
</tr>
</table>
<table border="1">
<tr>
<td>row 3, cell 1</td>   
<td>row 3, cell 2</td>
</tr>
<tr> 
<td>row 4, cell 1</td>
<td>row 4, cell 2</td>
</tr>
</table>
<p>text </p>                       # Split by text, the below is a different table
<table border="1">
<tr>
<td>row 5, cell 1</td>
<td>row 5, cell 2</td>
</tr>
<tr>
<td>row 6, cell 1</td>
<td>row 6, cell 2</td>
</tr>

</body>'

这是我当前的代码，它只选择第一个表标签而不是第一个表（我的示例中的第 1-4 行）。我使用 gem tabler 解析器来提取表格。

require 'nokogiri'
require 'table_parser'

doc = Nokogiri::HTML(html)
table = Array.new

i = 0
doc.traverse do |node|
    if node.name == 'table' && i == 0
        table = TableParser::Parser::extract_table(node, node.path)
        i +=1
    end
end

puts table

score 0 · Accepted Answer

听起来您想合并连续的表：

# find each table that follows another table. Then reverse that so you're iterating from bottom to top.
doc.search('table + table').to_a.reverse.each do |table|
  # add each of the tables tr's to the previous table
  table.search('tr').each{|tr| table.previous.add_child tr}
  # then remove the table
  table.remove
end

html - ruby nokogiri 提取表跨多个页面/合并连续表

1 回答 1

Related

Reference