html-parsing - 从多个页面的特定 HTML 位置提取文本

Question

我一直在试验 Jericho HTML Parser 和 Selenium IDE，目的是从 HTML 内的特定位置跨多个页面提取文本。

我还没有找到一个简单的例子来说明如何做到这一点，我也不知道 java。

我想在一个文件夹中找到第一个表、第 4 行、第 1 个 div 中的所有 HTML 页面的任何文本字符串：

</table>
 <tr class="abc"><td class="xyz"><div align="center">The Text I don't want</div></td></tr>
 <tr class="abc"><td class="xyz"><div align="center">The Text I don't want</div></td></tr>
 <tr class="abc"><td class="xyz"><div align="center">The Text I don't want</div></td></tr>    
 <tr class="abc"><td class="xyz"><div align="center">The Text I want</div></td></tr>
</table>

并将所选文本打印到列表中的 txt 文件，如下所示：

    The Text I want
    Another Text I want

所有源文件都存储在本地，并且可能包含错误的 HTML，因此认为 Jericho 可能最适合此目的。但是，我很高兴学习任何方法来达到预期的结果。

score 0 · Accepted Answer

最后我选择了beautifulsoup，并使用了一个类似这样的python脚本：

# open source html file
with open(html_pathname, 'r') as html_file:
# using BeautifulSoup module search html tag's tree
soup = BeautifulSoup(html_file)
# find according your criteria "1st table, 6th tr, 1st td, 1st div"
trs = soup.html.body.table.tr.findNextSiblings('tr')[4].td.div
# write found text to result txt
print ' - writing to result txt'
result_file.write(''.join(trs.contents) + '\n')
print ' - ok!'

html-parsing - 从多个页面的特定 HTML 位置提取文本

1 回答 1

Related

Reference