是否可以使用 perl 中的 web::scraper 从网页输出 xml。例如,我的 html 如下所示(我从 URL 中提取了部分 html):
> <table class="reference">
> <tr>
> <th width="23%" align="left">Property</th>
> <th width="71%" align="left">Description</th>
> <th style="text-align:center;">DOM</th>
> </tr>
> <tr>
> <td><a href="prop_node_attributes.asp">attributes</a></td>
> <td>Returns a collection of a node's attributes</td>
> <td style="text-align:center;">1</td>
> </tr>
>
> <tr>
> <td><a href="prop_node_baseuri.asp">baseURI</a></td>
> <td>Returns the absolute base URI of a node</td>
> <td style="text-align:center;">3</td>
> </tr>
> <tr>
> <td><a href="prop_node_childnodes.asp">childNodes</a></td>
> <td>Returns a NodeList of child nodes for a node</td>
> <td style="text-align:center;">1</td>
> </tr>
> <tr>
> <td><a href="prop_node_firstchild.asp">firstChild</a></td>
> <td>Returns the first child of a node</td>
> <td style="text-align:center;">1</td>
> </tr>
> <tr>
> <td><a href="prop_node_lastchild.asp">lastChild</a></td>
> <td>Returns the last child of a node</td>
> <td style="text-align:center;">1</td>
> </tr>
> <tr>
> <td><a href="prop_node_localname.asp">localName</a></td>
> <td>Returns the local part of the name of a node</td>
> <td style="text-align:center;">2</td>
> </tr>
> <tr>
> <td><a href="prop_node_namespaceuri.asp">namespaceURI</a></td>
> <td>Returns the namespace URI of a node</td>
> <td style="text-align:center;">2</td>
> </tr>
> <tr>
> <td><a href="prop_node_nextsibling.asp">nextSibling</a></td>
> <td>Returns the next node at the same node tree level</td>
> <td style="text-align:center;">1</td>
> </tr>
> <tr>
> <td><a href="prop_node_nodename.asp">nodeName</a></td>
> <td>Returns the name of a node, depending on its type</td>
> <td style="text-align:center;">1</td>
> </tr>
> <tr>
> <td><a href="prop_node_nodetype.asp">nodeType</a></td>
> <td>Returns the type of a node</td>
> <td style="text-align:center;">1</td>
> </tr>
> <tr>
> <td><a href="prop_node_nodevalue.asp">nodeValue</a></td>
> <td>Sets or returns the value of a node, depending on its
> type</td>
> <td style="text-align:center;">1</td>
> </tr>
> <tr>
> <td><a href="prop_node_ownerdocument.asp">ownerDocument</a></td>
> <td>Returns the root element (document object) for a node</td>
> <td style="text-align:center;">2</td>
> </tr>
> <tr>
> <td><a href="prop_node_parentnode.asp">parentNode</a></td>
> <td>Returns the parent node of a node</td>
> <td style="text-align:center;">1</td>
> </tr>
> <tr>
> <td><a href="prop_node_prefix.asp">prefix</a></td>
> <td>Sets or returns the namespace prefix of a node</td>
> <td style="text-align:center;">2</td>
> </tr>
> <tr>
> <td><a href="prop_node_previoussibling.asp">previousSibling</a></td>
> <td>Returns the previous node at the same node tree level</td>
> <td style="text-align:center;">1</td>
> </tr>
> <tr>
> <td><a href="prop_node_textcontent.asp">textContent</a></td>
> <td>Sets or returns the textual content of a node and its
> descendants</td>
> <td style="text-align:center;">3</td>
> </tr>
> </table>
所以我的 perl 代码如下:
#!/usr/bin/perl
use warnings;
use strict;
use URI;
use Web::Scraper;
# website to scrape
my $urlToScrape = "http://www.w3schools.com/jsref/dom_obj_node.asp";
my $rennersdata = scraper {
process "table.reference > tr > td > a", 'renners[]' => 'TEXT';
process "table.reference > tr > td:nth-of-type(2)", 'landrenner[]' => 'TEXT';
process "table.reference > tr > td:nth-of-type(3)", 'dom[]' => 'TEXT';
};
my $res = $teamsdata->scrape(URI->new($urlToScrape));
for my $i (0 .. $#{$res->{renners}}) {
print "<PropertyList>\n";
print "<Property>\n";
print "<Name> ";
print $res->{renners}[$i];
print "\n";
print "</Name>";
print "\n";
print "</Property>\n";
print "</PropertyList>\n";
}
for my $j (0 .. $#{$res->{landrenner}}) {
print "<ReturnValue>\n";
print $res->{landrenner}[$j];
print "\n";
print "</ReturnValue>\n";
}
for my $k (0 .. $#{$res->{dom}}) {
print "<domversion>\n";
print $res->{dom}[$k];
print "\n";
print "</domversion>\n";
}
当我运行上面的代码时,我得到的所有输出如下:
<PropertyList>
<Property>
<Name>attributes</Name>
<Property>
<PropertyList>
<PropertyList>
<Property>
<Name>baseURI</Name>
<Property>
<PropertyList>
...
<ReturnValue>
Returns a collection of a node's attributes
</ReturnValue>
....
<domversion>
1
</domversion>
....
是否有可能得到如下输出:
<PropertyList>
<Property>
<Name>attributes</Name>
<ReturnValue>Returns a collection of a node's attributes</ReturnValue>
<DOMVersion>1</DOMVersion>
</Property>
</PropertyList>
如何组合上述三个 forloop 以获得上述输出?
非常感谢