python - 使用 RegEx 结合 HTML::TreeBuilder 匹配多个“id”值

Question

我有一个数组中的 URL 列表：

http://www.site.sx/doc1.html
http://www.site.sx/doc2.html
http://www.site.sx/doc3.html
.
.
.

我们来看第一页的内容，即doc1.html：

<?xmlversion = "1.0" encoding = "utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
     "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
   <head>
      <title>Birds</title>
   </head>

   <body>
      <p>Some bird's feather's aren't actually blue, they're clear.</p>
      <!--LOOK HERE--><p id = "abc123FACT1xyz789">There exists an insect that makes 100-decibel sounds.</p> 
   </body>
</html>

现在，让我们查看第二页的内容，即 doc2.html：

<?xmlversion = "1.0" encoding = "utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
     "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
   <head>
      <title>Cats</title>
   </head>

   <body>
      <p>Moota goes from house to house.</p>
      <!--LOOK HERE--><p id = "abc123FACT2xyz789">Falling from a higher altitude might be better than a lower one.</p> 
   </body>
</html>

doc3.html 的值将具有相同abc123.....xyz789的模式类型，ìd我的数组中的其余页面也是如此。我想捕捉每一个的文本内容。每个文档中只有一个id具有这种特定模式的值。当然，实际上id整个文档都有多个值，但是——为了简单起见——我们可以忽略这一点。

大图：我想把每场比赛都像这样：

$tree->look_down( _tag => 'p' , id => "abc123.*xyz789")->as_text; # NOT SURE HOW TO MAKE AN ARRAY OF MATCHES...

score 0 · Accepted Answer

my $match = $tree->look_down( _tag => 'p' , id => qr{abc123.*xyz789} )->as_text;

这将得到我所追求的。

python - 使用 RegEx 结合 HTML::TreeBuilder 匹配多个“id”值

1 回答 1

Related

Reference