php - 仅当两者都存在时，XPath 从 HTML 中抓取两个节点值

Question

我正在使用 Curl、XPath 和 PHP 来从 HTML 源代码中抓取产品名称和价格。这是一个类似于我正在检查的源代码的示例：

<div class="Gamesdb">
  <p class="media-title">
    <a href="/Games/Console/4-/105/Bluetooth-Headset/">Bluetooth Headset</a>
  </p>
  <p class="sub-title"> Console </p>
  <p class="rating star-50">
    <a href="/Games/Console/4-/105/Bluetooth-Headset/ProductReviews.html">(1)</a>
  </p>
  <p class="mt5">
    <span class="price-preffix">
      <a href="/Games/Console/4-/105/Bluetooth-Headset/">1 New</a>
      from 
    </span>
    <a class="wt-link" href="/Games/Console/4-/105/Bluetooth-Headset/">
      <span class="price">
        <em>£34</em>
        .99
      </span>
      <span class="free-delivery"> FREE delivery</span>
    </a>
  </p>
  <p class="mt10">
    <a class="primary button" href="/Games/Console/4-/105/Bluetooth-Headset/">
      Product Details
      <span style="color: rgb(255, 255, 255); margin-left: 6px; font-size: 16px;">»</span>
    </a>
  </p>
</div>

我想提取媒体标题，即：

<p class="media-title">
    <a href="/Games/Console/4-/105/Bluetooth-Headset/">Bluetooth Headset</a>
    </p>

仅当以下价格等级也存在时：

<span class="price">
    <em>£34</em>
    .99
    </span>

列出的许多其他产品不包括它。我需要提取产品名称和价格，或者什么都不提取，然后转到下一个产品。

这是我当前使用的代码示例，无论其他任何条件如何，它都能有效地获得所有结果：

$results=file_get_contents('SCRAPEDHTML.txt');

$html = new DOMDocument();
@$html->loadHtml($results);
$xpath = new DOMXPath($html);
$nodelist = $xpath->query('//p[@class="media-title"]|//span[@class="price"]');

foreach ($nodelist as $n){

$results2[]=$n->nodeValue;

}

我相信使用正确的 xpath 查询可以做到这一点，但到目前为止还无法实现。提前谢谢了。

score 0 · Accepted Answer

您不能有一个 XPath 既返回产品名称又返回其价格，而没有其他任何东西。我的建议是首先获取div包含这两种信息的所有节点：

//div[p[@class='media-title'] and //span[@class='price']]

('所有具有类div的p子节点和类media-title的span后代节点的节点price'); 然后在所有返回的节点上循环并使用另外两个 XPath 提取产品名称和价格：

p[@class='media-title']

和

//span[@class='price']

score 0 · Accepted Answer

我假设每个div.Gamesdb. 如果没有，则源 html 中可能没有足够的结构来单独使用 xpath。您可能必须索引产品名称并寻找匹配产品名称附近的价格。

您可以使用单个巨型 XPath 来完成此操作，但我建议您使用多个 XPath。我将展示两种方式。

首先创建您的DOMXPath并注册助手以匹配类名。

// This helper is the equivalent to the XPath:
// contains(concat(' ',normalize-space(@attr),' '), ' $token ')
// It's not necessary, but it's a bit easier to read and more
// bulletproof than @ATTR="TOKEN"
function has_token($attr, $token)
{
    $attr = $attr[0];
    $regex = '/(?:^|\s)'.preg_quote($token,'/').'(?:\s|$)/Su';
    return (bool) preg_match($regex, $attr->value);
}

$xp = new DOMXPath($d);
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPHPFunctions("has_token");

然后你可以使用一个巨大的 XPath：

$xp_container = '/html/body//div[php:function("has_token", @class, "Gamesdb")]';
$xp_title = 'p[php:function("has_token", @class, "media-title")]';
$xp_price = '//span[php:function("has_token", @class, "price")]';

$xp_titles_prices = "$xp_container[{$xp_title}][{$xp_price}]/{$xp_title} | $xp_container[{$xp_title}][{$xp_price}]{$xp_price}";


$nodes = $xp->query($xp_items);

$items = array();

$i = 0; // enumerator
foreach ($nodes as $node) {
    $key = ($node->nodeName==='p') ? 'title' : 'price';
    $value = '';
    switch ($key) {
        case 'price':
            // remove inner whitespace
            $value = preg_replace('/\s+/Su', '', trim($node->textContent));
            break;
        case 'title':
            $value = preg_replace('/\s+/Su', ' ', trim($node->textContent));
            break;
    }
    $items[(int) floor($i/2)][$key] = $value;
    $i += 1;
}

但是，整体代码很脆弱且不清楚。XPath 联合运算符 ( |) 按文档顺序返回节点，因此我们不能将列表一分为二。PHP 代码必须遍历节点列表中的每个项目，并使用 DOM 确定与该数据对应的字段。想想如果您想扩展代码以收集第三个项目（例如，价格），您必须进行哪些更改。现在想象一下在三个月后进行这些更改，那时这段代码在您的脑海中不再新鲜。

我建议您改为使用多个 XPath 调用，并在 PHP 而不是 XPath 中执行“我们是否有价格和标题的数据”检查：

$xpitems = '/html/body//div[php:function("has_token", @class, "Gamesdb")]';
// below use $xpitems context:
$xptitle = 'normalize-space(p[php:function("has_token", @class, "media-title")])';
$xpprice = 'normalize-space(//span[php:function("has_token", @class, "price")])';

$nodeitems = $xp->query($xpitems);

$items = array();
foreach ($nodeitems as $nodeitem) {
    $item = array(
        'title' => $xp->evaluate($xptitle, $nodeitem),
        'price' => str_replace(' ', '', $xp->evaluate($xpprice, $nodeitem)),
    );
        // Only add this item if we have data for *all* fields:
    if (count(array_filter($item)) === count($item)) {
        $items[] = $item;
    }
}

这更容易阅读和理解，并且在未来更容易扩展。

php - 仅当两者都存在时，XPath 从 HTML 中抓取两个节点值

2 回答 2

Related

Reference