2

给定如下 HTML:

...more html above...
<div class="any_name">
  <p>Element A goes here</p>
  <p>Element B goes here</p>
</div>
...more html below...

我需要获取包含(例如)“A go”文本的任何元素的 xpath 路径,并获得如下信息:

/html/body/div[4]/div[2]/div/article/div/p

请注意,每种情况下的结构可能不同,我每次都需要搜索整个文档以查找文本...

实际上,我成功地获取了 Web 内容,但是在 Web::Scraper 中应用这样的//element[text()="A goes"]似乎不起作用。

如何使用内容获取此 xpath 路由?有任何想法吗?谢谢!

4

1 回答 1

3

您可以使用XML::Twig来获得它。我稍微更改了您提供的 xpath 并使其更加模块化。

use strict; use warnings;
use feature 'say';
use XML::Twig;
my $twig = XML::Twig->new();
$twig->parse(<<_HTML_
<html><body>
<div class="any_name">
  <p>Element A goes here</p>
  <p>Element B goes here</p>
</div>
</body></html>
_HTML_
);

for my $letter (qw(A B C)) {
  foreach my $t ($twig->get_xpath("//p[string()=~/$letter goes/]")) {
    say $t->xpath;
  }
}

You can use a regular expression in your xpath to find the elements that match your letter. The one with text()= didn't work in this case, because XML::Twig matches the complete text if you use = instead of =~ //. Also, the correct syntax is string(), not text().

The get_xpath method returns a list of elements. I use the xpath method on each of them, which returns the full xpath to the element. In my case that is:

/html/body/div/p[1]
/html/body/div/p[2]

There is no match for C because I did not put it in the HTML code.

于 2012-06-25T10:56:57.853 回答