string - 如何通过 sed 解析 html - 提取由两个字符串分隔的两个字符串 - 在不同的行上，按顺序

Question

我有一个 bash 脚本：

v1='value="'
v2='" type'

do_parse_html_file() {
   sed -n "s/.*${v1}//;s/${v2}.*//p" "${_SCRIPT_PATH}/IBlockListLists.html"|egrep '^http' >${_tmp_file}
}

...仅从 html 文件中提取 URL。我想输出：

某名网址
某名网址

--- 输入 html 文件的示例如下：

</tr>
<tr class="alt01">
<td><b><a href="http://www.iblocklist.com/list.php?list=bcoepfyewziejvcqyhqo">iana-reserved</a></b></td>
<td>Bluetack</td>
<td><img style="border:0;" src="I-BlockList%20%7C%20Lists_files/star_4.png" alt="" height="15" width="75"></td>
<td><input style="width:200px; outline:none; border-style:solid; border-width:1px; border-color:#ccc;" id="bcoepfyewziejvcqyhqo" readonly="readonly" onclick="select_text('bcoepfyewziejvcqyhqo');" value="http://list.iblocklist.com/?list=bcoepfyewziejvcqyhqo&amp;fileformat=p2p&amp;archiveformat=gz" type="text"></td>
</tr>
<tr class="alt02">
<td><b><a href="http://www.iblocklist.com/list.php?list=cslpybexmxyuacbyuvib">iana-private</a></b></td>
<td>Bluetack</td>
<td><img style="border:0;" src="I-BlockList%20%7C%20Lists_files/star_4.png" alt="" height="15" width="75"></td>
<td><input style="width:200px; outline:none; border-style:solid; border-width:1px; border-color:#ccc;" id="cslpybexmxyuacbyuvib" readonly="readonly" onclick="select_text('cslpybexmxyuacbyuvib');" value="http://list.iblocklist.com/?list=cslpybexmxyuacbyuvib&amp;fileformat=p2p&amp;archiveformat=gz" type="text"></td>
</tr>

--- 结果应如下所示：

iana-reserved http://list.iblocklist.com/?list=bcoepfyewziejvcqyhqo&fileformat=p2p&archiveformat=gz iana-private http://list.iblocklist.com/?list=cslpybexmxyuacbyuvib&fileformat=p2p&archiveformat=gz

---是否可以通过 sed 在一行命令中获得它？如果是这样，请帮助。

列表的第一部分 - “somename”总是排在第一位，而不是紧随其后的 URL 位于下一个 / 不必是第二个 / 行。

>somename   ... is delimited by   'href="URL">'   and   '</a>'       on one line           
>URL ... is always delimited by   'value="'       and   '" type'     on any following line

谢谢你，
亲切的问候。
M。

score 2 · Accepted Answer

使用我的cli html 解析器Xidel，它是一行：

xidel "${_SCRIPT_PATH}/IBlockListLists.html" -e '//a/concat(., " ", @href)'

score 1 · Accepted Answer

shell不是执行此操作的正确工具。

我可以向您展示一些使用HTML 解析器在python或perl ( ruby, java, php)中执行此操作的脚本。这些是这项工作的正确工具。

这可能是本网站上讨论最多的问题，请参阅这篇出色的帖子

制作这个网站的其中一个人也写了这个

score 0 · Accepted Answer

使用解析器。其中有很多，这里是一个使用HTML::TokeParser.

内容script.pl：

#!/usr/bin/env perl

use warnings;
use strict;
use HTML::TokeParser;

my $p = HTML::TokeParser->new( shift ) || die;

while ( my $tag = $p->get_tag( 'a' ) ) { 
    printf qq|%s %s\n|, $p->get_text, $tag->[1]{href};
}

像这样运行它：

perl-5.14.2 script.pl htmlfile

这会产生：

iana-reserved http://www.iblocklist.com/list.php?list=bcoepfyewziejvcqyhqo
iana-private http://www.iblocklist.com/list.php?list=cslpybexmxyuacbyuvib

string - 如何通过 sed 解析 html - 提取由两个字符串分隔的两个字符串 - 在不同的行上，按顺序

3 回答 3

Related

Reference