想要处理几个带有表格的html页面。
页面:
- 包含几个无类表,唯一的方法是如何识别正确的表
- 所需表格在第一个单元格值“内容”中
问题:如何使用 Web::Scrape 或 Scrappy 或其他工具根据其单元格值找到正确的表格?
示例代码:
#!/usr/bin/env perl
use 5.014;
use warnings;
use Web::Scraper;
use YAML;
my $html = do { local $/; <DATA> };
my $table = scraper {
#the easy way - table with class, or id or any attribute
#process 'table.xxx > tr', 'rows[]' => scraper {
#unfortunately, the table hasn't class='xxx', so :(
process 'NEED_HELP_HERE > tr', 'rows[]' => scraper {
process 'th', 'header' => 'TEXT';
process 'td', 'cols[]' => 'TEXT';
};
};
my $result = $table->scrape( $html );
say Dump($result);
__DATA__
<head><title>title</title></head>
<body>
<table><tr><th class="inverted">header</th><td>value</td></tr></table>
<!-- here are several another tables (different count) -->
<table> <!-- would be easy with some class="xxx" -->
<tr>
<th class="inverted">Content</th> <!-- Need this table - 1st cell == "Content" -->
<td class="inverted">col-1</td>
<td class="inverted">col-n</td>
</tr>
<tr>
<th>Date</th>
<td>2012</td>
<td>2001</td>
</tr>
<tr>
<th>Banana</th>
<td>val-1</td>
<td>val-n</td>
</tr>
</table>
</body>
</html>